First Version of Mapping Sound to Gestures and Emotions

Size: px

Start display at page:

Download "First Version of Mapping Sound to Gestures and Emotions"

Benedict Jones
5 years ago
Views:

Project Title: i-treasures: Intangible Treasures Capturing the Intangible Cultural Heritage and Learning the Rare Know- How of Living Human Treasures Contract No: Instrument:

Deliverable D4.2 First Version of Mapping Sound to Gestures and Emotions Due date of deliverable: Actual submission date: Month 24 16 March 2015 Version: 2nd version of D4.

1 Project Title: i-treasures: Intangible Treasures Capturing the Intangible Cultural Heritage and Learning the Rare Know- How of Living Human Treasures Contract No: Instrument: Thematic Priority: FP7-ICT Large Scale Integrated Project (IP) ICT for access to cultural resources Start of project: 1 February 2013 Duration: 48 months Deliverable D4.2 First Version of Mapping Sound to Gestures and Emotions Due date of deliverable: Actual submission date: Month March 2015 Version: 2nd version of D4.2 Main Authors: Edgar Hemery (ARMINES), Christina Volioti (UOM) Project funded by the European Community under the 7th Framework Programme for Research and Technological Development.

2 Project ref. number Project title ICT i-treasures - Intangible Treasures Capturing the Intangible Cultural Heritage and Learning the Rare Know- How of Living Human Treasures Deliverable title First Version of Mapping Sound to Gestures and Emotions Deliverable number D4.2 Deliverable version Version 2 Previous version(s) Version 1 Contractual date of delivery 31 January 2015 Actual date of delivery 16 March 2015 Deliverable filename Nature of deliverable Dissemination level d4.2_fv4 R PU Number of pages 65 Workpackage WP 4 Partner responsible Author(s) Editor EC Project Officer ARMINES Edgar Hemery (ARMINES), Christina Volioti (UOM), Athanasios Manitsaris (UOM), Raoul De Charrette (UOM), Viki Tsekouropoulou (ARMINES) Stelios Hadjidimitriou (AUTH), Vasileios Charisis (AUTH), Leontios Hadjileontiadis (AUTH), Joelle Tilmanne (UMONS) Edgar Hemery (ARMINES), Christina Volioti (UOM) Alina Senn Abstract This deliverable presents the methodology for mapping sound to gestures, taking into consideration the emotional status of the performer. In this first version of the mapping sound to gestures and emotions module, a multimodal data fusion has been done based on different and heterogeneous sensors that have been integrated into a unified interface for gesture and emotion recognition, mapping and synthesis. Filename:d4.2_fv5.docx Page 2 of 65

3 Keywords Signatures Mapping, sound, gesture, emotional status, data fusion, synthesis Written by Responsibility- Company Date Edgar Hemery (ARMINES), Christina Volioti (UOM) Reponsible for D4.2 (ARMINES) 6/2/2015 Verified by Sotiris Manitsaris Task 4.3 leader (ARMINES) 20/2/2015 Yiannis Kompatsiaris WP4 leader (CERTH) 15/3/2015 Approved by Nikos Grammaliis Coordinator (CERTH) 16/3/2015 Kosmas Dimitropoulos Quality Manager (CERTH) 16/3/2015 Filename:d4.2_fv5.docx Page 3 of 65

4 Table of Contents 1. Executive summary Introduction Background and ambition Aim of the deliverable State of the art Musical gestures and emotions as Intagible Cultural Heritage Musical gestures as referential patterns of composers Gestural knowledge of the performer and acquisition from the learner Emotional expression in music Emotions in musical performance Emotions in musical composition Gesture control of sound Explicit mapping of gestures to sound Implicit mapping of gestures to sound Sound synthesis engines New interfaces for musical expression MIDI controllers Physiological signals Gesture capture Conclusions from the state of the art Objectives beyond the state of the art Methodological approach Methodology of mapping sound to gestures and emotions Gesture Vocabulary and hierarchical metaphors on the Intangible Musical Instrument Composing with gestures Preliminary analysis of inherited knowledge from the composer: The paradigm of Beethoven's sonatas as Cultural Heritage Musicological analysis of "Waldstein" and "The Tempest" sonatas Defining emotional normative rating for the sonatas Beethoven s Waldstein Sonata Beethoven s The Tempest Sonata Analysis of results Overview of the Intangible Musical Instrument Aim of the Intangible Musical Instrument Capturing of the upper body gestures including fingers Overview of the sensors and structural details Descriptors used for gesture capture Multimodal data fusion Filename:d4.2_fv5.docx Page 4 of 65

5 6.4. Capturing and recognition of the emotional status Mapping strategies for Intangible Musical Instrument Explicit mapping based on finger gesture recognition Implicit mapping based on head, arms and vertebral axis gesture recognition Sound synthesis taking into consideration the emotional status Technical implementation and software development Prototyping the Intangible Musical Instrument Architecture of the Intangible Instrument Unified interface for gesture and emotion recognition, mapping and synthesis Detailed description of the unified interface Game-like application Ongoing and future work The augmented musical score based on gestural and emotional notation Hierarchical Mapping and Hybrid sound Synthesis Proposition of prototyping the IMI Conclusions Appendix I. Intangible Musical Instrument and Musical Education Game Appendix II. Additional information for sound synthesis engines Additive Synthesis applications Granular synthesis applications FM Synthesis applications Appendix III. Multimodal data fusion Technical generalization of a sensor client Technical details References Filename:d4.2_fv5.docx Page 5 of 65

6 List of Abbreviations CMC ICH IMI NUI CH TCH HMM ADSR IRCAM NIME EMG OSC SAM AMOS SVD UDP RPAS NTP CNMAT DTW Contemporary Music Composition Intangible Cultural Heritage Intangible Musical Instrument Natural User Interface Cultural Heritage Tangible Cultural Heritage Hidden Markov Models Attack Decay Sustain Release Institut de Recherche et Coordination Acoustique/ Musique New interfaces for Musical Expression Electromyography Open Sound Control Self- Assessment Manikin Analysis of Moment Structures Singular Value Decomposition User Datagram Protocol Rest- Preparation-Attack-Sustain Network Time Protocol Center of New Music and Audio Technologies Dynamic Data Warping Filename:d4.2_fv5.docx Page 6 of 65

7 1. Executive summary This deliverable is the result of the first phase of Task 4.3 Mapping Sound to Gestures and Emotions (month 15-month 24). It presents the methodology for mapping sound to gestures, taking into consideration the emotional status of the performer. This task is related to the Contemporary Music Composition (CMC) use-case. It develops a prototype Natural User Interface (NUI), named the Intangible Musical Instrument (IMI), which aims to facilitate the access to the knowledge of the performers that constitutes musical Intangible Cultural Heritage (ICH) using democratized technologies that are easily accessed from the large public. More specifically, this prototype is able to capture, model and recognize musical gestures (upper body including fingers) as well as to map them with sound. The emotional status of the performer (expert or learner) impacts the sound parameters at the synthesis level. In this version, the research effort mainly focused on the definition of the emotional normative ratings,which is a critical step to determine the level of granularity of the synthesis parameters. Moreover, the IMI aims to contribute to making the performing, learning and composing with gestures a first-person experience for beginners. In the first phase of Task 4.3, a special focus has been given to the design of a unified concept for learning, performing and composing with gestures and emotions. The most difficult part of the work done so far has been the conceptualisation of a unified approach (learning, performing and composing with gestures) of a multi-layer ICH, defined both by the composer and the expert performer (for the purposes of this deliverable, the composer is the creator of a musical score and the performer is a musician that interprets the music score on a musical instrument, which can be an expert (holder of the ICH) or a learner). Another very important challenge that this task had to deal with is the multimodality of the multimodal musical performance, including both gestures (upper part of the body including fingers) and emotions. Therefore, musical gestures and emotional status are not only recognized, but also mapped and synthesized to sound. Within the Task 4.3 significant effort has been done for the definition of emotional normative ratings for the Sonatas «Waldstein» and «The Tempest» of Beethoven. For this purpose, a morphological analysis is done on these Sonatas and referential patterns of Beethoven have been defined. The appropriate interactive questionnaires have been designed to create a statistical normative rating that describes the emotional status when an individual listens to musical excerpts that constitute an ICH. Based on these ratings, emotional annotations for the same referential patterns of Beethoven will be created and used to augment the inherited music score. The Task 4.3 is partially based on algorithms and modules developed in the Tasks 3.2 (body and gesture recognition) and 3.3 (electroencephalography analysis). Nevertheless, modifications had to be done on these algorithms not only because of the data fusion but also of the very special needs of this use-case. A very important effort has been made to solve various integration issues related to the variety of sensors, operating systems, third part controller tools etc. From a technical point of view, the IMI is designed to get input from motion and brain-signal sensors such as, the inertial sensors of Animazoo, the Kinect, the Leap Motion and the Emotiv. In this first version of the mapping sound to gestures and emotions module, a multimodal data fusion has been done based on different and heterogeneous sensors that have been integrated into a unified interface for gesture and emotion recognition, mapping and synthesis. The prototype of IMI is optionally supported on a table made of Plexiglas and metal, shaped so as to look like a table where the user can put his/her hands on it at a comfortable height. Finally, a future work of this task is the design of a visualisation module named augmented music score that will be integrated in the 3D platform, the goal of which will be the facilitation of the access to the knowledge of the expert performer, both gestural and emotional.

8 2. Introduction 2.1. Background and ambition According to UNESCO, music is the most universal form of the performing arts since it can be found in every society, usually as an integral part of other performing art forms and other domains of the Intangible Cultural Heritage (ICH). Music can be found in a large variety of contexts such as classical, contemporary or popular, sacred etc. Instruments, artefacts and objects in general are closely linked with musical expressions and they are all included in the Convention s definition of the ICH. Music that fits with the western form of notation is better protected. Nevertheless, those that don t fit with the western notation are usually threatened. In any case, the key point for all music forms is to have access to the gestural knowledge of playing a musical instrument and the strengthening of the bond between the expert holder of the ICH, which is the performer, and the learner. Motivated from this need, in recent years, researchers focused on the study of embodiment and enactive concepts. These concepts reflect the contribution of the body movement to the action/perception and the mind/environment interaction [80]. In performing arts, and more precisely in music, body movement is semantically connected with gesture in most activities, such as performing and composing. On the one hand, composers bring together knowledge and skills in sound colouring and organisation in terms of structure and form, as well as idiomatic gestures in musical and physical playing that lead to the organisation of the material, culminating in compositional structure. This always brings to the surface the question how does this work? According to various viewpoints such as those of Allen Forte, Arnold Whittall, Rosemary Killiam and Patrick McCreless, the theory explains the facts [106]. Leveraging on this, theory cannot be considered the ultimate mean for accessing musical knowledge. Therefore, theory can explain how a piece of music works but not how the composer nor the performer actually functions. On the other hand, performance is the result of the symbiosis between the musician and his/her instrument. This symbiosis takes the form of an interactional and gravitational relationship, where the musician is both a trigger and transmitter connecting: perception (mediated instrumental mechanisms and physical environment) knowledge (inherited music score of the composer, music theory etc) gesture (semantic body movements encapsulating functional knowledge). Consequently, the expert musical gesture can be considered as a fully embodied notion that encapsulates the knowledge of the performer to produce sounds and interpret music pieces, following the musical notation defined by the composer. Moreover, the musical instrument is a tangible interface that can be considered as a means of musical expression and performance. Nevertheless, the learning curve of playing musical instruments requires years of training, practice, and apprenticeship before being able to perform. Furthermore, the learning of expert musical gestures is still viewed as a second-person perspective, which means as a communicative act of social interaction, rather than «my own» personal experience. Consequently, «learning» musical gestures and «performing» music are usually perceived as separate concepts and experiences. This means that the access to knowledge is a long-term procedure since there is no quick transition from novice to expert. Leveraging on the above concepts and taking into consideration the context of the Contemporary Music Composition use case, the task 4.3 Mapping sound to gestures and emotions is based on capture, modelling and recognition of gestures of expert performers and aims to conceptualise and develop a unified Natural User Interface (NUI), named the Intangible Musical Instrument (IMI) that supports as a unified user experience both: 1. the learning of expert musical gestures (Transmission) and; 2. the performing and composing with gestures (Preservation and Renewal). Filename:d4.2_fv5.docx Page 8 of 65

9 All the visualisation aspects of the NUI are implemented in the task 5.2 3D Visualization Module for Sensorimotor Learning Aim of the deliverable This deliverable, named D4.2: First deliverable of mapping sound to gestures and emotions, is the main outcome of the first phase of Task 4.3, which is part of WP4: Data fusion and semantic analysis. It presents the objectives and the current status of the mapping engine integrated into the Intangible Musical Instrument, which conceptually correlates gestures and sound taking into consideration the emotional status of the performer. Filename:d4.2_fv5.docx Page 9 of 65

10 3. State of the art 3.1. Musical gestures and emotions as Intagible Cultural Heritage Musical gestures as referential patterns of composers Gesture is the core activity of music creation; a dynamic organism, similar to the human organism; an experience that combines structural properties of music together with cultural and historical contexts [53][84][85][86][87][88]. Talking about musical gestures and Cultural Heritage (CH), there is an endless list of composers and knowledge that constitute an ICH. Since this is not the goal of this deliverable, some representative examples of classical composers will be given. For example, short musical patterns, which can be easily imitated through body gestures, constitute for Beethoven, the palette of his compositions. These short patterns, and their variations, constitute an ongoing unfolding process throughout his musical pieces. Many analysts, consider this practice as a self-referential context where musical gestures, similar to other variations of the same gesture, are recognized within the same piece. Another example can be the musical collage of gestures in Sinfonia of Berio. Gestural patterns of Mahler, Ravel and Debussy are integrated into the new musical piece Sinfonia remaining as representative musical idioms that transmit music-related cultural meanings. Consequently, «musical patterns» and «gestural patterns in music» are closely-linked notions, since sonic forms are understood though embodiment. These patterns constitute elements of social interaction and differentiation since their imitation entails the acquisition of cultural models for emulation. According to McNeill [79], these patterns can be considered throughout history, as playing an important role in creating and sustaining human communities that can be understood as a mirror system between composer and listener or even master and learner [81][82]. Nevertheless, musical pieces documented through musical scores, which constitute a Tangible Cultural Heritage (TCH), encapsulate only abstract information about energy and expressivity of gestures, which are finally incarnated through the interpretation of performers on musical instruments (Figure 1). Filename:d4.2_fv5.docx Page 10 of 65

1.1. Effective, accompanist and figurative gestures The gesture vocabulary described by Delalande, which has been extensively used in the literature ([52], [53], [54]), splits the musical gesture

11 Figure 1. Composer's knowledge (first ICH layer) Effective, accompanist and figurative gestures The gesture vocabulary described by Delalande, which has been extensively used in the literature ([52], [53], [54]), splits the musical gesture into three classes: 1. Effective gesture: necessary to mechanically produce the sound (eg, press a key) 2. Accompanist gesture: body movement associated with the effective movement - is as related to imagination as to the effective production of the sound. 3. Figurative gesture: not related to any sound producing movements, it conveys a symbolic message. This description is interesting since it is more specific to the instrumental gesture than the previous one. It gives us a more detailed categorization to analyse musical gestures Gestural knowledge of the performer and acquisition from the learner The examples documented in the previous section show that the musical meaning of gestural knowledge involves different levels of information, which are: a) first-person, b) second-person and c) third-person perspective on gesture. The first-person perspective on gesture defines the meaning of the gesture for the person that actually implements it. Within the ICH context, the expert performer are holders of ICH that have perfected their knowledge to include high-level specific characteristics. Additionally, the learner can also have first-person perspective when playing a musical instrument. The difference between them is that the expert has developed, at a greater level than the learner (it really depends on the level of the learner), his/her action-based approach to gesture because s/he knows all the gestural patterns in music. S/he has a mental access on how the action described on the musical score is deployed Filename:d4.2_fv5.docx Page 11 of 65

This approach is the most typical one that is used in musical schools, conservatories etc.

12 over time and s/he has the capacity to control his/her sensorimotor system that produces the corresponding sonic form. The second-person perspective on gesture refers to how other people perceive the musical gesture in a social interaction context. This approach is the most typical one that is used in musical schools, conservatories etc. The learner observes the experts, which in most cases are his/her teacher, following the concept of «my» perception of «your» gesture [89]. According to this «me-to-you» relationship, a mirroring system is established between expert and learner, where the body movements of the learner are deployed so that the movement of the expert, incorporating the knowledge of the composer derived from the score, is understood as an action from the learner. The third-person perspective on gesture focuses on the measurement and capturing of moving objects. This task can be done by a computer using audio-recording, video-recording, motion capture technologies and brain scans as well as physiological body changes [89][90][91]. This way, the knowledge of the performer is captured based on techniques of feature extraction and pattern matching. Figure 2. Performer's knowledge (second ICH layer) 3.2. Emotional expression in music There is no need for scientific evidence to support the fact that music expresses emotions, as personal affective experiences during music listening are more than enough. However, a vast amount of research has been conducted in order to reveal further insights in this phenomenon ranging from philosophical to biological approaches [67]. It has been suggested that such musicinduced emotions are governed by universality in terms of musical culture, meaning, listeners with different cultural backgrounds can infer emotions in culture-specific music to a certain extent. Such evidence led to the assumption that neurobiological functions underlying such emotional Filename:d4.2_fv5.docx Page 12 of 65

13 experiences do not differ across members of different cultures, as the neural networks responsible may be fixed. In general, the processing of musical stimuli involves the gradual analysis of music structural elements from basic acoustic features to musical syntax that leads to the perception of emotions and semantic meanings underlying the stimuli [68]. It is becoming evident that the structure of music defines what it expresses. To be more accurate, music does not literally express emotion as it is not a sentient creature, but it is its structural elements and production performance shaping the acoustic outcome that foster the induction of emotional states to the listener, who is indeed a sentient being. In the following, affective states will often be characterised based on the valence-arousal model [69]. Valence denotes whether an emotion is positive or negative, while arousal refers to the level of excitation that the emotion encapsulates Emotions in musical performance Written music can be performed in different ways just like a piece of text can be read with various tones. In an important sense, music exists only when it is performed and performances of the same work can differ significantly. The latter form the concept of performance expression that refers to both (a) the correlation between the performer's interpretation of a musical excerpt and the smallscale variations in timing, dynamics, vibrato, and articulation that shape the microstructure of the performance and (b) the relationship between such variations and the listener's perception of the performance. It has been proposed that performance expression emerges from five different sources, i.e., Generative rules, Emotion expression, Random fluctuations, Motion principles, and Stylistic unexpectedness, referred to as the GERMS model [70]. Here, the focus is placed on emotional expression that allows the performer to convey emotions to listeners by manipulating features such as tempo and loudness in order to render the performance with the emotional characteristics that seem suitable for the particular musical piece. The primary acoustic cues of emotional expression in music performance, along with the perceived correlates in parentheses, are considered the following [71]: Pitch: Fundamental frequency (pitch), fundamental frequency contour (intonation contour), and vibrato (vibrato). Intensity: Intensity (loudness) and attack (rapidity of tone onsets) Temporal features: Tempo (velocity of music), articulation (proportion of sound to silence in successive notes), and timing (tempo and rhythm variation). Timbre: high-frequency energy (timbre), the singer's formant (timbre). Manipulation of such cues has been found to be related to certain affective states. Below, such relationships with five basic emotions are presented [71]: Happiness: fast mean tempo, small tempo variability, staccato articulation, large articulation, high sound level, little sound level variability, bright timbre, fast tone attacks, small timing vibrations, sharp duration contrasts, rising micro-intonation. Tenderness: slow mean tempo, slow tone attacks, low sound level, small sound level variability, legato articulation, soft timbre, large timing variations, accents on stable notes, soft duration contrasts, final ritardando. Sadness: slow mean tempo, legato articulation, small articulation variability, low sound level, dull timbre, large timing variations, soft duration contrasts, slow tone attacks, flat microintonation, slow vibrato, final ritardando. Fear: staccato articulation, very low sound level, large sound level, fast mean tempo, large tempo variability, large timing variations, soft spectrum, sharp micro-intonation, fast-shallowirregular vibrato. Filename:d4.2_fv5.docx Page 13 of 65

14 Anger: high sound level, sharp timbre, spectral noise, fast mean tempo, small tempo variability, staccato articulation, abrupt tone attacks, sharp duration contrasts, accents on unstable notes, large vibration extent, no ritardando. Moreover, another means of expressing emotions in performance that is structure-related is ornamentation, i.e., notes that embellish a melodic line, the presence/absence of which does not influence the underlying melodic structure. From the aforementioned, it is becoming evident that there is no one-to-one mapping of cues to emotions, as, for example, slow mean tempo is related both to tenderness and sadness. As a consequence, performers have to employ many cues in order for the emotion communication to be more effective; the larger the number of expressive cues used - even if some of them are redundant - the more effective the communication of emotion. From the above description it is clearly conceivable that emotions play a significant role in musical artistic expression. Consequently, the analysis and manipulation of users affective states should be taken under serious consideration within IMI design, development and practice that aims to support performing «music» Emotions in musical composition Emotion and music, however, start to interweave at the genesis of a musical work that is its composition: 'A composer... knows the forms of emotions and can handle them, "compose" them' [72]. Musical structure-related factors that contribute to emotional expression are usually represented by designations in conventional musical notation, such as tempo marking, dynamic markings, pitch, intervals, mode, melody, rhythm, harmony, and various formal properties (e.g. repetition, variation, transposition). An epitomised reference of the correlation of such factors with emotional expression is given below based on the review of Gabrielsson and Lindström (2010) [73]: Tempo, note density: As mentioned above, fast tempo is associated with high emotional arousal, while slow tempo with low arousal. Both of them may be related to either positive or negative emotional valence. Moreover, high note density, i.e., the number of notes per time unit, is linked to expressions of higher arousal. Mode, key: Major mode is often associated with positive valence, while minor mode is related to negative valence. However, emotional expression is highly dependent on the context, and it is sometimes irrespective of mode. Intervals: Large intervals are considered more arousing than small ones, the octave expresses positive valence, while the minor second is the most "sad" interval. Melody: Wide melodic range is associated with high arousal, while narrow range expresses low arousal feelings. Melodic direction and contour, on the other hand, play no significant role in emotional expression. Stepwise melodic motion may suggest low arousal. Moreover, melodies including regular occurrence of perfect fourths, minor sevenths, and no or few tritones are considered pleasant. On the other hand, melodies with a greater occurrence of minor seconds, tritones, and intervals larger than the octave express high arousal, while potency is often conveyed by the regular occurrence of unisons and octaves. Harmony: Consonant harmony is often linked to positively-valenced feelings of low to medium arousal, while dissonant harmony expresses negatively-valenced feelings of low to high arousal. Tonality: Tonal melodies are considered to induce a sense of joy and peacefulness, while angry melodies may be atonal. Melodies of negative valence have been found to include chromatic harmony. Rhythm: Smooth-fluent rhythms convey feelings of positive valence and of low arousal, while irregular ones are associated with high arousal affective states, both positive and negative. Firm rhythm may be linked to expressions of sadness, dignity, and vigour. Filename:d4.2_fv5.docx Page 14 of 65

15 Pauses/rests: The expression conveyed due to pauses depends on the context. Usually, rests following tonal closure are perceived as less tense. Musical form: High complexity may be associated with negatively-valenced feelings of high or low arousal, if dynamism is high or low, respectively. On the other hand, low complexity combined with low to average dynamism is linked to positive emotions of low to medium arousal. Repetition, condensation, sequential development, and pauses may suggest increased tension. Factors such as loudness, pitch, timbre and articulation and their role in emotional expression have been discussed in Section in terms of performance expression and the same apply as far as music composition is concerned. As in performance, emotional expression is a function of more than one of the aforementioned structural factors, working either in an additive way or interacting towards the shaping of the emotional expression. In this context, the emotional expression of the composer should be a core component of IMI since it is also a compositional tool aiming at the renewal of related ICH content Gesture control of sound Mapping gesture to sound is the procedure, which correlates the gesture input data with the sound control parameters. In order to implement the gesture-sound mapping procedure we need first to decide, which gesture characteristics and sound synthesis variables we are going to use, that is, answer the question what to map where and secondly how this is going to be accomplished. We also need to decide whether we will use explicit or implicit mapping. In explicit mapping, the mathematical relationships between input and output are directly set by the user. On the contrary, indirect mapping generally refers to the use of machine learning techniques, implying a training phase to set parameters that are not directly accessed by the user [1] Explicit mapping of gestures to sound Direct or explicit mapping refers to an analytical function that corelates output parameters with input parameters. In its simplest form, called one-to-one, there is a single output parameter per input parameter. Another form, called one-to-many is divergent in the sense that for one input parameter; there are several output parameters. Similarly, the many-to-one mapping uses several inputs for one single output parameter Implicit mapping of gestures to sound Indirect or implicit mapping can be seen as a black box between input and output parameters. The desired behavior of this mapping is specified through machine learning algorithms that require training phases [2], or purposely designed as stochastic [3]. For example, an analysis of gesture features based on Hidden Markov Models (HMM) allows estimating the most likely temporal sequence with respect to a template gesture. This can be used to characterize a real-time or live gesture with respect to the template one. Many approaches focus on recognizing gesture units independently of the sound process, but there are some recent approaches that focus on learning more directly the mapping through the relation between gesture and sound. More specifically, a mapping strategy is proposed by [4], between gesture data and synthesis model parameters by means of perceptual spaces. They define three layers in the mapping chain: from gesture data to gesture perceptual space, from sound perceptual space to synthesis model parameters, and between the two perceptual spaces. Another approach for gesture-sound relationship analysis proposes the use of multimodal feature space dimensionality reduction [5]. Also, [6] proposes a divergence measure, which is defined Filename:d4.2_fv5.docx Page 15 of 65

16 based on a Hidden Markov Model (HMM) that is used to model the time profile of sound descriptors [7] and present a prototypic tool for sound selection driven by users' gestures, based on computing the similarity between both gesture and sound parameters' temporal evolution. The algorithms that the system contains for time series multimodal matching are Correlation selection based on Canonical Correlation Analysis (similar to Principal Component Analysis), time-warping selection based on temporal alignment of both multidimensional signals and a hybrid strategy, which uses both correlation-based measure and temporal alignment. Moreover, another approach is a hierarchical one [8], which takes into account multilevel time structures in both gesture and sound processes. They implement Hierarchical Hidden Markov Models to model gesture input and temporal mapping strategies based on instantaneous relationships between gesture and sound synthesis parameters. A gesture segmentation method was also presented based on this approach, considering several phases such as preparation, attack, sustain, release. Finally, an approach called mapping by demonstration allowing users to design the mapping by performing gestures while listening to sound examples, is proposed by [9][10][11]. The system is based on a multimodal HMM to conjointly model the gesture and sound parameters (Figure 3). This approach aims to learn the relationships between movement and sound from examples created by the user. The model is trained by one or multiple gesture performances associated to sound templates. It captures the temporal structure of gesture and sound as well as the variations which occur between multiple performances. For performance, the model is used to predict in real-time the sound control parameters associated with a new gesture. Figure 3. Gesture-sound mapping with the multimodal HMM Our goal is to develop a methodology which will be based on the above approaches and will use algorithms such as time warping algorithm, to map gestures to sound. In combination with the development of Intangible Musical Instrument (IMI) this methodology will provide, through a natural user interface, a natural gestural experience to user (learner/expert/composer) so as to learn expert musical gestures as well as to perform them and compose with them Sound synthesis engines Synthesis implies the artificial construction of a complex body by combining its elements. Below are described the major types of sound synthesis engines: additive synthesis, granular synthesis and frequency modulation (FM) synthesis. Filename:d4.2_fv5.docx Page 16 of 65

17 Additive Synthesis Additive synthesis produces a new sound by adding together two or more audio signals. These signals are sinusoids with amplitudes, frequencies and phases, according to the theoretical principles developed by Fourier [13]: y(n) = N k=1 A k sin(ω k t + φ k ) (1) The resultant absolute amplitude is the sum of the amplitudes of the individual signals. The resultant frequency is the sum of the individual frequencies. This is potentially a very powerful technique, as it was shown well over 200 years ago by Fourier that any periodic sound (i.e., pitched sound) can be represented by the sum of simple sine waves [14]. So that means that a complex timbre that has been analyzed into its sinusoidal components can then be reconstructed. In practice, however, the approach can be extremely time-consuming Granular Synthesis Granular synthesis is a synthesis method by which sounds are broken into tiny grains which are then redistributed and re-organized to form other sounds. Grain is a small piece of sonic data, a microsound with duration between 10 to 50 ms. Grain can be broken in smaller pieces, the envelopes and contents. An envelope, in musical sound, is the attack, sustain, decay and release of the sound. Attack is the period of time before sound reaches its steady-state intensity. Decay refers to a drop in intensity between attack and sustain time. Sustain refers to steady state of sound in its maximum intensity. Finally, release is the time it takes for the sound to fade to silence [16][17]. Contents of the grain can be any sound sample with complex spectrum. The ADSR envelope model, as we see in Figure 4, is used in almost every synthesis engine [18]: Figure 4. A simple envelope The process of adding a grain envelope is known as windowing. Windowing is a type of amplitude modulation technique. The idea of windowing comes from an analysis process where a small part of a complex sound sample can be put into a window frame, making it easier to analyse the sound and avoiding clipping and clicks. When the grain content is windowed it imposes an amplitude change over the content [18] Frequency modulation (FM) Synthesis Frequency modulation (FM) is a method that generates sound by combining a waveform to another waveform that is modulated in frequency, and outputs the sound as a new waveform [11]. FM Synthesis is an alternative approach to altering the harmonic character of a generated wave through the use of filters. FM Synthesis, instead, employs a modulator oscillator that varies the frequency of the sound signal, thus producing new harmonics [14]. FM synthesizer is known to be good at making a metallic sound Research approaches on sound synthesis engines Pablo Arias who was an intern doing his master's thesis in IRCAM, has developed a hybrid synthesis engine. This engine was composed by an additive part for the resynthesis of harmonic Filename:d4.2_fv5.docx Page 17 of 65

18 content and a granular part with onset preservation for noise and transients content. The team created this engine in order to be controlled by their demonstration system which allowed them to infer sound parameters in real time. They used technologies like SDIF, pipo and MuBu [20]. Another approach is from Rodet and Depalle, from IRCAM team, who studied additive synthesis and they managed to make a new additive synthesis method based on spectral envelopes and inverse Fast Fourier Transform (FFT-1) [21]. Also, Xavier Hosxe who works as a Software Enginner in Paris, created preenfm. PreenFM is an open source and open hardware project that lets you build a hardware FM synth module [22]. For more information referring to applications of sound synthesis engines see Appendix II. Our goal is to develop a methodology, which will combine some of the proposed techniques and sound synthesis engines in such a way that the user will be able to change sound parameters in real time, effectively and harmonically New interfaces for musical expression Emerging technologies in the field of musical interfaces and instruments have changed the way modern music is produced. Computer music is mostly performed with the help of laptops nowadays. Extremely static and unexpressive by nature, laptop performances do not substitute for real artistic performances. The lack of performativity, often criticized as the motionless performer behind the laptop [55], does not allow for any representation of how the sounds are being created. To allow musical concreteness in computer music, for both the audience and musicians, interprets need some kind of physical intermediate in order to have control over what they want to manipulate. New technologies such as devices, sensors or controllers, allow connections between physical movements and parameters in the sound synthesis. Starting in the 80 s with the rise of computer music and the standardization of the MIDI technology, new musical interfaces have not ceased to evolve. These interfaces, offering live interaction between musicians and computers, appeared as computer s processors efficiency was capable to process sound synthesis in real-time. The research done in this field is multidisciplinary and extremely large, covering computer sciences, electronics, signal processing, mechanics, computer vision and even natural sciences such as biology and neurology. All these scientific communities developed singular technologies and systems, motivated by a common desire to create new and exciting ways to control, manipulate and perform sounds and music. Since the 90 s, the conference NIME, standing for New Interfaces for Musical Expression, witnessed for the drastic evolution of musical interfaces. This conference reaches an interdisciplinary audience of artists, scientists and technologists and keeps track on the latest developments in musical interface design and musical expression. The meeting gathers instrument makers, research labs, computer music institutions and computer music enthusiasts from all over the world. There are many ways to present the state of the art of new musical interfaces and interactive music systems. It can be done chronologically, grouped by scientific branch, grouped by the type of technology that is used or classified by artistic fields (electronic music, performances, installations, etc ). If we consider by new only inventions of the last two decades, we can probably cut short into three types of technologies: - Laptop & Midi controllers - Systems reacting to physiological signals - Systems reacting to gesture Filename:d4.2_fv5.docx Page 18 of 65

19 MIDI controllers The first type of musical interface is the most well known and widely used. It exist in the form of MIDI keyboards, drum pads, knobs and sliders, touchpads, drawing tablets and even in the form of MIDI instruments: MIDI harp, MIDI saxophone, MIDI violin, etc In the late 2000s, MIDI controllers, like the Monome [58], the Tenori-on by Yamaha [57] or even the Novation Lauchpad [56] turned into grid-based controllers, where LEDs buttons trigger sounds or sequences of sounds. With the exception of the Tenori-on, which has an integrated sound synthesis engine, these controllers work along with laptops and need other controllers such as sliders to control dynamics and knobs for balance or effects levels for instance. Nowadays, modern musician predominantly work with these MIDI controllers. Almost all contemporary productions using electronics and computers are played or controlled to a certain extent with MIDI controllers. Only a tiny sub-group of electronic performers actually use other types of interfaces, and mostly for performance purposes only Physiological signals Fewer musicians have been experimenting with instruments capturing physiological signals. These instruments, originally developed for bio- or neuro- research purposes, are connected to laptops and transduce human electrical activity into sounds. Various biofeedback, such as muscle contraction, heartbeat rate, skin conductivity, and even brain waves can be detected and transformed into sound with some signal processing [106]. Due to their technological complexity and sensitivity, they are no easy to set up. Moreover, they are very experimental, and whoever wants to use them should better have some computing skills and some physical training. For example, in the musical piece Suspensions by Atau Tanaka [113], the interpret Sarah Nicolls plays the piano with some surface electrodes, called electromyography (EMG), taped to her skin. The EMG records the electrical activity in her muscles and transduces parameters such as speed of contraction and strength into the computer. Atau Tanaka receives these parameters, and through some live electronics, modifies the natural sound of the piano, adding effects matching with the theatrical performance given by Sarah Nicolls. The EMG performance is intriguing to see since you do not see directly the gesture causing the sound, but a gesture caused by an inside contraction of muscles. Interestingly, the EMG can detect extra information to the gestural data, which is not necessarily visible to the naked eye [103][104][105] Gesture capture The third type of interface makes use of gestural data from the body (partially or entirely) to generate sounds through live electronics. Interactive systems allowing performing with body gestures have appeared in the 80 s. Motion capture sensors allow precise tracking of gestures in 3D and feed the data in the form of MIDI parameters to sound synthesizers. The first glove transforming hands and gestures into sounds were created for a performance at the Ars Electronica Festival Ironically called the Lady s Glove [107], it was made of a pair of rubber kitchen gloves with five Hall effect transducers glued at the tip of fingers and a magnet on the right hand, varying voltages were fed to a Forth board and converted into MIDI signals. Preceded with the Digital Baton of the MIT Media Lab in 1996 [59], a major breakthrough in musical interfaces came with the inertial sensors such as accelerometers and gyroscopes, placed in contact with body motion capture sensors or held in hand (cf: The USB Virtual Maestro using a WiiRemote [60], MO Musical Objects [61]). Dozens of interesting projects for virtual dance and music environment using motion capture have been presented over the last decade of NIME conference. At last, gestural data can be obtained with computer vision algorithms. Computer vision is a branch of computer science interested in acquiring, processing, analyzing, and understanding multidimensional data from images and videos. Object recognition and video tracking systems are ideal for musical performances since they allow freedom in body expression and are not intrusive; as opposed to motion capture devices. This has become possible thanks to the latest improvements in 3D cameras that deliver sufficient precision and decent frame rates speeds. Moreover, this Filename:d4.2_fv5.docx Page 19 of 65

20 technology is much cheaper, accessible and convenient than inertial suits ($100 for a Kinect vs. $15k-$30k for an inertial mocap suit). Computer vision and music can also meet in the shape of tangible objects. The TUIO objects developed for the Reactable [62], awarded with the Prix Ars Electronia (prestigious price that "honours creativity and innovativeness in the use of digital media"), can be connected together and trigger sounds with simple object recognition. As a TUIO object is recognized on the table or on any surface, it immediately sends OSC messages (similar to MIDI messages) [63] to a sound engine. With a more familiar appearance, the airpiano [109] is one of the first digital musical interfaces to introduce an intuitive and simple touch-free interaction. Most touch-free interfaces require users to stare at a 2D display, while the user s hand move in 3D. However, musicians and performers need to be able to play their instruments in a free and intuitive way. The airpiano keys and faders are not on the screen, but above the airpiano surface. Thanks to its ability to be aware of muscular movement and position (called kinesthetic sense), the performer knows the exact position of each controller in the air after a training phase. On the surface, a matrix of LED displays the keys and faders, which shine when the hand is above. This feedback is a sufficient marker to learn where to place the hands. One last striking example of computer vision-based musical interaction using computer vision is the system created at numediart-umons [64] using Leap Motion through PVC material to track hand gesture [65]. The hand-skeleton is then used to control sound-synthesis parameters using machinelearning algorithms Conclusions from the state of the art Leveraging on the above sections, the conclusions are summarized in the bullets below: Non-holistic approach on gestural analysis and mapping to sound without taking into consideration the emotional status Current acoustic or digital musical instruments take into consideration only effective gestures. Nevertheless, the accompanying and figurative gestures are also very important since they are related with the expressivity and the social interaction of the performer with the audience. Moreover, the emotional status of the performer is a modality of high importance that is not explicitly taken into consideration in live performances. Long learning curve to access the musical ICH via tangible musical instruments «Learning» musical gestures and «performing» music are usually perceived as separate concepts and experiences that pass through intermediate physical mechanisms. Usually for learners, the «challenges» related to the physical instrument are more important than «skills» in music playing, thus causing frustration. This means that the access to knowledge is a long-term procedure since there is no quick transition from novice to expert. Learning the expertise of a musical performer is a second-person experience The study and interpretation of musical pieces of composers is a first-person experience. In opposite, learning the gestural expertise of a specific performer is still more or less a black box for the learner since it can be only approached as a second-person experience. Consequently, only abstract sonic movements that are derived from gestures can be considered as transmitted knowledge to the learner. Ordinary inherited music scores without any gestural or emotional notation So far, music notation is created by the composer, it reflects his/her knowledge about the musical piece and it contains, among others, very abstract gestural information about energy and expressivity. Moreover, the performer enriches the knowledge extracted from the score with his/her own gestural knowledge when interpreting a musical piece. Nevertheless, the knowledge of the Filename:d4.2_fv5.docx Page 20 of 65

notation it is extremely difficult to transmit it to the next generations

21 performer, gestural or emotional, is not documented. In cases where a musical piece doesn t follow the western form of music notation it is extremely difficult to transmit it to the next generations (Figure 5). Figure 5. Conclusions from the state of the art Filename:d4.2_fv5.docx Page 21 of 65

22 4. Objectives beyond the state of the art 1. Natural-User Interfacing the musical expression based on mapping sound to gestures A transition from the classical perception of a musical instrument to a prototype of Intangible Musical Instrument is created. Moreover, it is provided a holistic approach on gesture capturing and mapping to sound taking into consideration the emotional status of the performer at the synthesis level. The parameters of a signal model are estimated and a new signal on the basis of this model is generated, including finger gestures as well. The sampling parameters of the new signal depends on the emotional status of the performer. Finally, the IMI facilitates the musical skills development in terms of gestural expression and emotion elicitation. 2. Learning, performing and composing with gestures as a first-person experience Put the user in the core of musical activities such as the performing and composing with gestures as well as appreciating music in general. Learning of expert musical gestures becomes a first-person experience since the learner has the possibility to gesturally control the sound excerpts from the expert. 3. Augmenting the music score to facilitate the access to musical ICH An enrichment of the standard music notation with augmented scoring that embeds gestural and emotional information. The musical knowledge of both composer and performer is provided to the learner through gestural and emotional annotation of the expert performance as well as the inherited music score. Figure 6. Beyond the state of the art Filename:d4.2_fv5.docx Page 22 of 65

Therefore, the gestural knowledge of the performers, which constitutes musical Intangible Cultural Heritage (ICH), lies at the core of our methodology.

23 5. Methodological approach 5.1. Methodology of mapping sound to gestures and emotions The methodology aims at the continuous and real-time gesture control of sound, taking into consideration the emotional status of the performer. Therefore, the gestural knowledge of the performers, which constitutes musical Intangible Cultural Heritage (ICH), lies at the core of our methodology. The proposed methodology has three main components which communicate with each other: Gestures: Expert gestural knowledge is captured based on techniques of feature extraction. Even though our methodolgy is generic, we focus on the case of piano. Then an offline gesture analysis is conducted, aiming to categorize the musical pianistic gestures of expert into the classes/types, according to Delalande s typology [52][53][54]. Sound: In parallel, the real sound that is produced by the expert pianist is recorded and saved in order to be used by the learner. Emotion: A morphological analysis has been done on Waldstein and Tempest Sonatas of Beethoven as well as a statistical normative rating that describes the emotional status when an individual listens to these musical excerpts. Machine Learning: Afterwards, multimodal data of expert, such as gestures (upper part of the body including fingers) and emotions are used to train the system using machine learning algorithms. Subsequently, the user (such as learner or performer) tries to perform/imitate the same expert s gestures on the Intangible Musical Instrument (IMI), where the recognition of musical gestures and emotions takes place using the trained system by the expert. Sound synthesis: These multimodal data (gesture and emotions) are mapped to various sound parameters, which result in re-synthesis of a plausible imitation of the original (expert s) sound in real time. The IMI is inspired by piano-like gestures, which means that the gestures of the performer resemble piano player gestures as metaphors, which are created of piano-like performance and mapped into sounds. A feedback (sonic and/or optical) is provided to the user, depending on his/her performance of expert gestures. For more details, please check the Section 6. Filename:d4.2_fv5.docx Page 23 of 65

24 Figure 7. Methodology of mapping sound to gestures taking into consideration the emotional status of the performer Our setup prototype is a construction made of Plexiglas and wood, shaped so as to look like a table on which you can put your hands (Figure 8). The dimensions of the table are 70 cm long, 40 cm wide and 13 cm high. The setup lies on a table so that the hands on the Plexiglas are placed at a comfortable height. Two Leap Motions are placed inside the construction and one Kinect is placed above. The Leap Motions are centred in the halved parts of the surface. The Kinect is placed in front and slightly above the table. Additionally, two inertial sensors are taped to the wrists. Finally, an electroencephalogram can be mounted on the head to record brain electrical patterns. Figure 8. Intangible Musical Instrument As far as the composing with gestures objective of the Contemporary-Music Composition use-case is concerned, the above approach will be extended in the next months so as to cover contemporary voice synthesis as well. We are going to adapt an existing voice processing platform that enables the discrete and continuous control of vocal properties from other modalities (motion, EEG signals ) in real-time. The specifications of this new voice processing platform are going to be shaped so as to fit i-treasures missions, i.e. finding innovative ways to highlight the cultural heritage. This work will be enabled by appropriately choosing musical gestures according to targeted cultural practices and using them to control voice generation parameters. Vocal contents will aim at referring to cultural heritage, either by using existing ancient texts / song lyrics or telling the stories of studied practices. Nevertheless, the voice synthesis functionalities will be approached from a compositional point of view only because it may not refer to the typical forms of musical ICH. Also the analysis of audio features is another possibility that may be considered in the future if needed. For the time being, the mapping between gesture and sound, which is developed and described in details in Section 6.6, associates a sound to a template gesture and links temporal states of a sound with temporal states of the template gesture Gesture Vocabulary and hierarchical metaphors on the Intangible Musical Instrument Filename:d4.2_fv5.docx Page 24 of 65

25 Based on Delalande s categorisation of musical gestures, (cf: Effective, Accompanist, Figurative in section 3.1.1), we created a vocabulary, which hierarchizes the effective gestures from basic to complex ones, as seen in musical scores. As you can see on Figure 9, primitives are at the basis and more complex notations/gestures add up layers by layers. The leading idea is that in a learning scenario, the learner could go up these layers, starting from the bottom in order to build up his/her knowledge of piano-like techniques. Figure 9. Hierarchical representation of musical vocabulary We consider that accompanist gestures only become important at ornaments level. This doesn t mean that the two classes are exclusive, but as accompanist gestures involve actively the whole upper body, it demands different sensors to be captured. Typically, basic effective gestures are captured by the Leap Motion and Kinect sensors, accompanist gestures are captured by the Kinect and Animazoo sensor as well as figurative gestures by the Kinect Composing with gestures As it is already mentioned, the key point for all music forms is to have access to the gestural knowledge of playing a musical instrument and the strengthening of the bond between the expert holder of the ICH, which is the composer or performer, and the learner. As a result, IMI not only supports the learning of expert musical gestures but also the performing and composing with gestures. The learning and performing process has been described in the Section 5.1, as well as in Section 6.6 and 7.3. Composing with gestures, is a similar process like learning and performing in terms of implementation, but for different purpose. While in the learning and performing phase the user (learner) is asked to imitate predefined expert gestures taken from the vocabulary and s/he makes practice, in the composing with gestures, the composer has the ability to add/import new gestures in the vocabulary (by using the learning phase, for more details see Section 7.3.1), to describe the expressiveness (explain the appropriate emotions that the performer should imitate), to map the gestures/emotions into sounds by defining the sonic spaces and parameters and finally to depict all these into a musical score. As a result, the performer or composer are able to experiment with their own gesture-sound mappings and audio synthesis as well as to compose contemporary music by performing/imitating gestures one after the other, by using fingers, body gestures and emotions. Summarizing composing with gestures, IMI provides the ability for users: to import and train the system with any kind of gestures (musical, natural everyday gestures, etc.) as well as Filename:d4.2_fv5.docx Page 25 of 65

to set any kind of mapping between gestures and sounds, according to their requirements. The goal is to provide a generic system which can be adapted to specific performers and performances. 5.2.

26 to set any kind of mapping between gestures and sounds, according to their requirements. The goal is to provide a generic system which can be adapted to specific performers and performances Preliminary analysis of inherited knowledge from the composer: The paradigm of Beethoven's sonatas as Cultural Heritage Musicological analysis of "Waldstein" and "The Tempest" sonatas Many analysts dealt with the morphological analysis of music and presented the results of them. A famous one, Schenker proved that the analysis of any form is a tonic triad, something that was accepted by the great majority of the music world. Everybody has a point when analyzing, what it matters is the scope with which we deal with things. In order to arrive in our conclusions, we should explain how the procedure of listening and reading music works [111]. Narrative is the instrument of human thought since the narrative mode is the mode that deals with human intentions and actions. We experience what we hear, the structures of a piece. These structures are necessary to transmit ideas that process their own internal structure. Study of musical narrativity is absolutely necessary and so is musical motive. As stated, narrative is a basic category of human mind. The narrative- generative process unfolds as a gradual expansion or composing out of achronic fundamental structure, so that is why it is very important to include it when we need to generate something new [77][112]. In Waldstein Sonata, first movement, Beethoven narrates a melody which sticks to human brain because he uses certain repeated motifs, such as repeated notes at a fast tempo, or ascending scales that lead to a peak, or cadenzas that drop and end the phrase. Also it is vital to focus on the fingering used (though there are no exclusively used fingers), so as to realize how the composer achieves his goal to get the melody where he wants. To be more specific when in meters (see Figure 10) which is a passage of an ascending chromatic scale and descending arpege, alternative fingering motifs are used and we can see that the point is to reach an upper goal of elevating our sentimental status. Although there are motions going downwards the aim is to go up (the participation of left hand with the ascending arpege, makes it more clear). Figure 10. Some indicative musical meters (24-27) from Waldstein Sonata In meters vertical chords and descending movements exist. This is a continuous gesture that can be depicted on the piano in a monotonous, in a good sense, way and give to the performers the ability of making musicby moving the wrist up and down slightly-. The performer can also express feelings such as happiness or anger. Filename:d4.2_fv5.docx Page 26 of 65

In 50 58 arpege gestures give the possibility of moving the hand in a freer way, more relaxed, like dancing gracefully on the piano and further on the IMI.

27 In arpege gestures give the possibility of moving the hand in a freer way, more relaxed, like dancing gracefully on the piano and further on the IMI. The fingering used is almost the same in all arpeges -1 st 3 rd or 4 th finger- something which is simple enough to be reproduced on the IMI and create new music. In arpege gestures give the possibility of moving the hand in a freer way, more relaxed, like dancing gracefully on the piano. The fingering used is almost the same in all arpeges -1 st 3 rd or 4 th finger- something which is simple enough to be reproduced on the IMI and create new music. Some metres before ending, a candenza is the omen of an ending phrase. The use of all fingers and the continuous alteration of 1 st, 2 nd and 3 rd fingers makes the melody more powerful and gives the impression of something that goes to a completion. Similarly the performers on the piano may look for an accomplished target and finish their composition. In Tempest Sonata in the 2 nd -5 th meter there is a repetition of second intervals with the use of 1 st and 3 rd finger. These intervalswhen played continuously, form a kind of obsession, which figuratively, prepares the storm that it follows in musical terms of analysis, that is to say the breakout of several melodies. Moreover, this stable motif gives them the opportunity to get familiarized to specific sounds/ melody lines, so as to identify them later on. Also the repetitive movements give the feeling of intense sentiments and satisfaction of composing in a simple way, something new, since the performers can easily make new music by simply moving the hand on the IMI, taking into account the 2 nd interval and the alteration of only two fingers. Meters 8-17 (see Figure 11) include repetitive intervals but now not only 2nds but also 3rds, 4ths, 5ths and so on. The 5 th finger here is extremely important since it gives the impulse for the descending movement of the hand. The molto energetico notification implies this tendency for powerful fingers and characteristic movement of downwards and upwards position of the wrist. Figure 11. Some indicative musical meters (8-17) from Tempest Sonata Meters and depict a chromatic ascending scale where alternative fingering motifs are used, mainly 1 st, 3 rd and 2 nd finger. Meters have a looser style, a freer one, since the arpege implies it. When playing arpeges the hands are like dancing gracefully on the piano and the wrist turns right, in order to go upwards (to the high pitches) whereas it turns left in order to go downwards (to the low pitches). In correspondence to this, when on the piano we teach the performers to do the same and imitate the ascending/ descending arpege, due to the freedom they have, the result can be either similar or very different and new sounds can be created. The innovation of this whole task is that in a very short time everybody can play on the IMI by using already existing material (heritage) and compose new things something extremely difficult if done in the traditional way [78]. Filename:d4.2_fv5.docx Page 27 of 65

28 Defining emotional normative rating for the sonatas Emotional states change rapidly during listening; emotions in music are therefore often assessed with dimensional models [92][93], with typical dimensions being valence and arousal level. Valence represents the perception of emotions as being either positive or negative and appears to be related to the presence of consonant or dissonant chords [94]. In contrast, arousal level indicates the degree of intensity of an emotion. Even if it seems to be independent of consonance or dissonance, it is strongly connected with loudness and musical expectation [95]. Violation of an expected chord usually increases emotional arousal, whereas the realization of that expectation lowers emotional arousal [96][97][98][99][100]. The definition of emotional normative ratings for the Sonatas «Waldstein» and «The Tempest» of Beethoven is intended to serve two purposes in the context of the Contemporary Music Composition use-case: Build a database of musical pieces, each labeled with a specific emotional rating in terms of valence and arousal. The emotional classification of these musical measures circumvents the limitation that there are only few well-characterized affective auditory stimuli sets available to researchers and the fact that most of them are short in duration. These musical excerpts will be used by the EmoGame as auditory stimuli to train the userelicit specific emotions and reach the required emotional state by will. EmoGame is a game-like Human-Computer Interaction application that aims to help the user learn and handle affective states and transitions towards an augmented artistic performance. EmoGame is meticulously described in deliverable D5.2 First Version of Visualisation for Sensorimotor Learning. The identified emotional labels of the musical excerpts will afterwards be transformed into emotion notations describing the affective space of the expert performer while s/he performed the given music measures. The notations will be provided as input to the "augmented music score, a visualization module which will be integrated in the Music Composition Game in order to facilitate the access to the knowledge of the expert, both gestural and emotional Procedure Figure 12. Two-Dimensional Model of Emotion An interactive online questionnaire has been designed to create the required statistical normative rating that describes the emotional status when an individual listens to specific musical excerpts. Stimuli consisted of musical parts taken from two Sonatas of Beethoven, Waldstein (no. 21) and The Tempest (no. 17). Subjects were 36 volunteers recruited through online calls to participate in our study. The participants rated their emotional responses (valence and arousal) to each musical excerpt using a 9-point Self-Assessment Manikin (SAM) scale, a pictorial rating system developed by Lang in the 1980 s [108] to obtain self-assessments of experienced emotions (Figure 13 and Figure 14). Filename:d4.2_fv5.docx Page 28 of 65

29 Figure point SAM scale for Valence (1 for displeasure and 9 for pleasure) Figure point SAM scale for Arousal The entries submitted by the subjects were statistically analyzed using the AMOS module of the SPSS software. All participants had to answer all of the 22 questions asked. Since there is no objective physical unit of measurement to compare against self-reported emotional experience, the evaluation has the mean and the standard deviation of all subjects responses as reference. Apart from the first and second statistical moments of the ratings, there was an effort to identify the causality between the subsequent ratings. Specifically, since most subjects listened to the stimuli successively, an emotional causality could be anticipated across the experiment, as the music excerpt previously played potentially could affect the listener s emotional response to the current one, following the composer s intention, expressed through the realised compositional structure. This transition from one musical part to the other might also enhance a learning effect; musical patterns introduced to the subject in a musical part, might cause a smaller violation of expectation when they occur in another one. According to the bibliography mentioned above, the realization of this expectation could be depicted with a lower arousal rating Results The results of the emotional analysis of Beethoven s Waldstein Sonata are shown in Figure 15 and Figure 16 below. Filename:d4.2_fv5.docx Page 29 of 65

30 Musical meters Emotional state Beethoven s Waldstein Sonata Mea n Musicological analysis 8-30 Valence 6.28 Arousal 5.92 Elevating a bit our sentimental status (it should have been more though since it s the beginning of the piece) Valence 5.14 Rather neutral status due to static rhythm Arousal Valence 5.61 Arousal 5.11 Valence 6.08 Arousal 5.53 Feeling more relaxed, graceful, dream like Repetitive motif not particularly interesting (it should have been lower in numbers) Valence Arousal 5.61 Valence Arousal 5.14 Arpeges give the sense of repitiion, rather neutral feelings Powerful motif, leading to the end (anxiety should have been caused) Table 1: Ratings of the Stimuli from Beethoven s Sonata no. 21 (Waldstein) Figure 15. The corresponding causality diagram Filename:d4.2_fv5.docx Page 30 of 65

31 Figure 16. Ratings of Valence (left) and Arousal (right) for Beethoven s Sonata no. 21 (Waldstein) Beethoven s The Tempest Sonata The results of the emotional analysis of Beethoven s The Tempest Sonata are shown in Figure 17 and Figure 18 below. Musical meters Emotional state Mean Musicological analysis Valence 4.64 Arousal 4.31 Sweet, rather flat motif, reactions completely justified Valence 4.67 Loose situation (maybe lower numbers) Arousal Valence 5.14 Arousal 4.44 Valence 4.97 Arousal 3.50 Valence 4.42 Arousal 2.83 Static rhythm rather indifferent motif (reactions completely justified) Motivation diminishes again due to neutral motif/rhythm Largo and pp reduce the motivation, however it s a strong passage covering a wide scale on the piano Table 2: Ratings of the Stimuli from Beethoven s Sonata no. 17 (Tempest) Figure 17. The corresponding causality diagram Filename:d4.2_fv5.docx Page 31 of 65

Figure 18. Ratings of Valence (left) and Arousal (right) for Beethoven s Sonata no. 17 (Tempest) 5.2.2.2.3.

The ratings of arousal and valence were distributed across the rest range of the scores.

32 Figure 18. Ratings of Valence (left) and Arousal (right) for Beethoven s Sonata no. 17 (Tempest) Analysis of results Overall, subjects seemed quite hesitant to use the extreme scores of the 9-point scale (i.e. values 1 and 9). The ratings of arousal and valence were distributed across the rest range of the scores. However, only the distribution of arousal for the Waldstein sonata fits the normal distribution and differs a lot from the corresponding one for the The Tempest sonata (Figure 17 and Figure 18 respectively). Since these musical measurements were presented to the subjects after a significant number of other auditory stimuli (Waldstein Sonata), there is a slight chance that the low arousal scores were due to the lack of excitement the participants may have felt at the time. It is also remarkable that the musical measurements are the only auditory stimuli from the Waldstein Sonata associated with a negative arousal rating (mean=3.94). As mentioned in the musicological analysis, the continuous gesture of that musical excerpt can be depicted in IMI in a monotonous way. This monotony could be interpreted as a natural musical expectation, which is linked with low arousal level, according to the bibliography [96][97][98][99][100].In terms of the causality relationship between subsequent ratings, the results presented in Table 1 and Table 2 supports the hypothesis that emotional state reached due to the musical measurements played previously affects and serves as an emotional memory to the emotional response to the current musical excerpt. Specically, the results presented in Figure 16 and Figure 18 are based on an operational model that hypothesises that a causality relationship exists between subsequesnt ratings. Each construct or latent variable, in this operational model, is constituted by two observable dimensions; valence and arousal. The operational model has been estimated using the Structural Equaltion Modelling (SEM) methodology, via the AMOS module of SPSS. All figures that are attached to the arrows linking two variables in the model refer to standardised coefficients. The levels and the signs of the standardised coefficients indicate the weight and the direction of the causality effect. The fit statistics that are attached to the results presented in Figure 5.8 are as follows: Chi-square = 112,357; df = 49; p = 0.000; Normed Chi-square = 2.293; GFI = 0.618; NFI = 0.390; CFI = 0.464, RMSEA = Similarly, the fit statistics that are attached to the results presented in Figure 5.10 are as follows: Chi-square = ; df = 31; p = 0.000; Normed Chisquare = 2.615; GFI = 0.712; NFI = ; CFI = 0.678, RMSEA = These figures indicate that most fit statistics do not pass the appropriate critical levels of the statistics. This is due to the fact that the sample sizes are rather small. However, considering that the Normed Chi-square ratios are much less than the critical level of 5, we believe that the results in Figures 1 and 3 may be used for inferences. Thus, the findings from the two causality operational models may summurised as follows: There is a causality relationship between successive emotional states, as it is seen by the significant standardised coefficents between constracts. For example, in Figure 5.10 (Tempest), we see that in general the influence of emotions is decreasing as we move from one Filename:d4.2_fv5.docx Page 32 of 65

33 measurement to the next : 1, : 1, : 0.94, : However, this decreasing influence is not seen in Figure 5.8 (Waldstein), because the value of each link between to successive constructs fluctuates in accordance to the emotional state measure. The weight of each dimension in determining a construct is indicated by the value of its standardised coeffient. For example, in Figure 3 (Tempest), let us consider the link : The weights for are Valence = 0.81 and Arousal = 0.57; and the weights for are Valence = 0.71 and Arousal = These results indicate that valence drive emotions more in both and Similarly, the influence of each dimension may be traced moving down the causality hierarchy. All in all, two things should be considered in future work: 1. Increasing the sample size. 2. Decreasing the scale resolution. Filename:d4.2_fv5.docx Page 33 of 65

34 6. Overview of the Intangible Musical Instrument 6.1. Aim of the Intangible Musical Instrument The Intangible Musical Instrument (IMI) is to be seen in the continuation of the new musical interfaces presented in the state of the art (section 3.3). It s a contribution to the development of gestural interface in the realm of computer music. Inspired by the piano, we use in our system metaphors of pianistic gesture. In the past decades, MIDI keyboards have become more and more popular because of their prices, their smaller sizes and their flexibility in terms of sound choices. The downsides are that they don t replace real pianos in terms of sound quality and don t fully use the potential of what could be done with digital interfaces. The Intangible Musical Instrument brings the concept of keyboard instruments to the fields of gesture recognition and human-computer interfaces. The IMI is an intuitive interactive system, which enables to play with the fingers and upper part of the body. On a deeper level, it aims at capturing piano-like gestures in order to create sounds with them. These gestures are finally transformed into sounds via a «mapping» phase. However, we must note that the objective is not a virtual replacement of the piano (nor any other keyboard instrument), but an adaptation of the existing techniques for this instrument to computer music, including electronic and electroacoustic music. The second objective, made possible with this computer vision-oriented system, is to create a powerful pedagogical tool where the learner interacts with the system in order to master piano-like techniques. Although the interaction has been simplified for the purpose of having a smooth learning curve, it requires some practice in order to perform elements of musical stylistics like dynamics and articulation. One last important aspect is the ergonomics of our framework and the relief experienced by resting wrists and hands onto a plain surface. The IMI can be played for a long period of time without feeling tired. The height of the table being adjustable, it can be played while standing or sitting. While the air-piano gestures (discussed in section 3.3) are often compared with very free movements, close to martial art movements, the gestures here include the upper part of the body and the fingers. Resembling other keyboard instruments, the player can intuitively start playing the Intangible Musical Instrument without much prior knowledge Capturing of the upper body gestures including fingers Overview of the sensors and structural details We describe here the successive steps achieved in order to first capture the gesture and then to model it. For the capturing part, we use two types of depth camera and two inertial sensors. The first type of depth camera is the Kinect, originally created for video gaming purposes. Equipped with a structured light projector, it can track the movement of whole body of individuals in 3D using a Random Decision Forest algorithm [108]. However, we are only interested in the upper part of the body and the current algorithm delivers a fairly accurate tracking of the head, shoulders, elbows and the hands, but not the fingers. The second type of camera used is the Leap Motion, which works with two monochromatic cameras and three infrared LEDs. The Leap Motion provides an accurate description of the hand skeleton, with more than 20 joints positions and velocities, both in 3D (x, y, z coordinates). Leap Motion is the owner of its algorithm and does not communicate it yet. We currently use one Leap Motion but we plan on using two, one for each hand. Each Leap Motion has a field of view of 150 and tracks the hand from below efficiently up to 30 cm above the camera centre (the camera is oriented upwards). Once placed on their slots on the IMI, they cover the whole surface of the table and a volume above it. A plate of 70 per 40 cm made of plexiglass delimits the framework of interaction of the IMI. Gestural Interaction is not limited to this 2D surface but to a volume including up to 30 cm above the Filename:d4.2_fv5.docx Page 34 of 65

plexiglass. This volume, 70 cm long, 40 cm wide and 30 cm high, is the space in which the performer has a control, similar to the touch-free experience in the air-piano.

35 plexiglass. This volume, 70 cm long, 40 cm wide and 30 cm high, is the space in which the performer has a control, similar to the touch-free experience in the air-piano. However, the constraint imposed by the plexiglass can be seen as a frame of reference for the fingers. When the fingers are in contact, you get a sound as you could intuitively expect. Also, repeating a gesture in the air relatively to the table is easier than in an environment without boundaries or points of reference. Additionally, the plexiglass constitutes a threshold above which the sensors work accurately. In this respect, it is a profitable constraint since it enables the user to intuitively place his/her hands at the right place and helps repeating similar gestures Descriptors used for gesture capture One Leap Motion on its own delivers the 3D coordinates of following joints (Figure 19): 1) palm 8) index_intermediate 15) ring_proximal 2) thumb_metacarpal 9) index_dip/distal 16) ring_intermediate 3) thumb_proximal 10) middle_metacarpal 17) ring_dip/distal 4) thumb_intermediate 11) middle_proximal 18) pinky_metacarpal 5) thumb_dip/distal 12) middle_intermediate 19) pinky_proximal 6) index_metacarpal 13) middle_dip/distal 20) pinky_intermediate 7) index_proximal 14) ring_metacarpal 21) pinky_dip/distal Figure 19. Leap motion s hand skeleton model Additionally, it provides the instantaneous velocities of these joints in the 3 dimensional space components. Including the two leap motions, we have 21 joints * 3 coordinates * 2 hands for both positions and velocities, that is 252 values in total per time frame. One can use this full set of descriptors listed as above in order to precisely capture a gesture. In a performance context; however, we can reduce this data set and keep only a set of necessary values. In this case, we only keep positions of palm and finger tips and only z-components of velocities. In total, this performance set sums up to 6 joints * 3 positions * 2 hands + 6 velocities, that is 48 values in total for both hands. To sum up, one can consider the full set of data appropriate for gesture recordings and the reduced set appropriate for performance and training purposes. The Kinect delivers 9 joints for the upper part of the body: 1. head 2. l_elbow 3. r_elbow 4. neck 5. l_hand 6. r_hand 7. l_shoulder 8. r_shoulder 9. torso The Animazoo inertial sensors delivers rotation angles (Euler angles) of the two wrists. Filename:d4.2_fv5.docx Page 35 of 65

36 6.3. Multimodal data fusion The skeleton fusion process intends to fuse the skeleton coming from N sensors and output a single fused skeleton as an OSC stream. The process is meant to be cross-platform and scalable, meaning that it can retrieve data from a various number of sensors and fuse them with an automatic calibration process using custom joint rules. The software ITSkeletonFusion has been developed to retrieve any type of skeleton data as long as they are streamed in the correct OSC format. The programs that grab data from a sensor and stream it to OSC are referred as SensorClient. To allow scalability each sensor client should output data to a different port. Sensor Clients for Kinect, IGS Animazoo, and Leap Motion Controller have been developed within the scope of this project (or adapted from existent software). They are also meant to be crossplatform, and they have been made to rely on the same libraries. In order the skeleton fusion to work, requires at least one sensor client to be started. Whenever a new sensor data arrives, the fusion is applied and the fusion skeleton is streamed in OSC. The last data frames of each sensor are used as inputs of the fusion algorithms. To fuse together the skeletons from all sensors, fusion rules are applied that define how joints from sensor client A and sensor client B should be coupled together. Generally speaking, a single rule is used for each fusion (e.g. left hand with left palm), but in some cases several rules may be required (e.g. left hand with left palm, and right hand with right palm). Whenever new data frames arrive, their current position is added to a temporal point cloud so as to form a data set that will be used for automatic calibration. Implicitly, we assume the sensors are not moving. For automatic calibration, rigid registration of the data is achieved using an SVD (Singular Value Decomposition) in order to extract the rotation and translation matrix that transform sensor client data A to sensor client data B. To avoid the registration to converge to a singular value, a prefilter is applied that prevent point clouds to be too dense locally. Additionally to prevent the temporal point clouds from drifting apart and to take into account that input data might be noisy, we apply a statistical outlier removal algorithm that remove data which were registered with: Pt registerror avg(errors) > 2 stddev(errors) (2) Implicitly, we assume the registration error to be Gaussian, which seems acceptable. Note also, that we do not take the absolute value, thus pushing our data to converge to a small registration error. Filename:d4.2_fv5.docx Page 36 of 65

l_hand (left hand joint) from the sensor client streaming Kinect data, should be fused with the joint l_palm (left hand palm) from the sensor client steaming Leap Motion Controller data.

37 Figure 20. Data fusion As an example, suppose we run the IT Skeleton Fusion with a sensor client streaming Kinect data and a sensor client streaming Leap Motion Controller data, we may assume that the joint l_hand (left hand joint) from the sensor client streaming Kinect data, should be fused with the joint l_palm (left hand palm) from the sensor client steaming Leap Motion Controller data. Fusion rules will automatically be used to calibrate the data from sensor client A to sensor client B. The implication is that fusion rules are not symmetrical. Hence, if fusing sensor client A to sensor client B the two data set will be expressed in referential of sensor data B. In other words, it may be seen as the hand from the leap motion being added to the Kinect skeleton Capturing and recognition of the emotional status An overview of the capturing and recognition of the emotional status process is given in Figure 21. The EPOC EEG headset by Emotiv (Emotiv Systems, Inc., San Francisco, CA) is used to capture EEG signals [74]. The EPOC is a 14-channel EEG recording wireless headset that streams the acquired electrophysiological data in real time to a paired PC. The communication between the headset and the PC is based on a proprietary wireless protocol that requires an EPOC USB receiver plugged in to the computer. The headset is powered by a lithium battery that provides 12 hours of autonomy on a full charge. A detailed description concerning the technical specifications of the EPOC headset was given in deliverable D3.1 "First Report on ICH Capture and Analysis"[102]. Filename:d4.2_fv5.docx Page 37 of 65

Figure 21. Overview of EEG capturing and recognition process of the emotional status. The EPOC headset is combined with an in house-developed affective state recognition software.

38 Figure 21. Overview of EEG capturing and recognition process of the emotional status. The EPOC headset is combined with an in house-developed affective state recognition software. The recognition software gathers the EEG data, in the background, and computes the current affective state of the user. In particular, the software captures the stream of the 14 channel raw EEG data and feeds them to a fractal dimension threshold-based recognition algorithm, based on the works of [75][76] that yields the valence (positive or negative) of the users' affective state in realtime. Its recognition cycle requires 4 seconds; thus, the user's affective valence is outputted every 4 seconds. The first version of the recognition algorithm was fully described in deliverable D3.1 "First Report on ICH Capture and Analysis" [102]. In the latest version of the algorithm, the excitement levels are also captured as an indicator of the user's arousal (high or low) and in combination with the detected valence, a basic, yet sufficient, characterization of his/her affective state based on the valence-arousal model is achieved [69]. Specifically, the algorithm is capable of recognising four general affective states, i.e., positive valence-high arousal, positive valence-low arousal, negative valence-high arousal, and negative valence-low arousal Mapping strategies for Intangible Musical Instrument According to previous studies presented in the state of the art (section 3.3.2), we adopted a conceptual approach to mapping that keeps the perceptual relationship between sound and gesture. The gesture perceptual parameters are the three different classes of the musical vocabulary (effective, accompanist and figurative gestures). The proposed methodology is based on the two-level approach of [47][48] and the three-level approach of [4][49] who uses perceptual gesture and sound spaces. The methodology is presented below: Filename:d4.2_fv5.docx Page 38 of 65

Figure 22. Mapping Methodology of IMI The mapping takes place in two layers and in two levels. The existing approaches take into consideration the explicit role of the mapping strategies [49].

39 Figure 22. Mapping Methodology of IMI The mapping takes place in two layers and in two levels. The existing approaches take into consideration the explicit role of the mapping strategies [49]. For this reason, we propose a second level, in which implicit mapping strategies take place. More specifically: Layer#1 and Level#1: connection of performer s gesture perceptual parameters to some set of sound perceptual parameters. Perceptual parameters are translated into concepts that can be perceived visually or sonically. This layer can consist of simple one-to-one relationships or explicit mapping. Layer#1 and Level#2: imitative performance with re-interpretation of a sound using a gesture. This type of mapping is called implicit mapping, in which the user can perform more or less realistic imitative synthesis, by synthesising the sound according to the performed gesture. Layer#2 (both levels): interpolation of parameter sets, by decoding these parameters, which determines the synthesized sound for continuous gestures. The need/purpose of using two types of mapping (explicit and implicit) is: 1. the multimodality of the musical performance, including both gestures (upper part of the body including fingers) and emotions 2. the hierarchical vocabulary, where at the basis are the primitives and more complex gestures add up layers by layers A proposed way of using, but not binding, is that explicit mapping can be used for recognizing notes (finger motions like in actual piano playing) and implicit mapping for temporal mapping of upper part of the body. Consequently, the expert can combine these two types of mapping in order the learner to have a smoother learning curve. But the composer can also combine these two types of mapping, in order to compose contemporary music by evoking imaginative mapping of gestures to sounds Explicit mapping based on finger gesture recognition The explicit mapping (also called direct mapping) creates a correlation between the fingers and the production of the note (Level#1 in Figure 22). Similarly with bijective functions, there is a one-to-one correspondence attributing gesture parameters, such as 3D positions of finger joints to the creation of specific notes. This function takes gesture as input and outputs sounds with a MIDI piano Filename:d4.2_fv5.docx Page 39 of 65

40 synthesizer (not to confound with a piano MIDI keyboard). The subtlety and precision of this function is the core of the direct mapping. The more dynamic and sensitive the system is the more musical and expressive the instrument is Finger and hand s Heuristics The plexiglass table fixes a frame of reference for the player s hands and arms.the latter is placed 13 cm above the two Leap Motions, where the sensors cover the area best. If the plexiglass were lower, we would also have a slightly better tracking but a smaller field of view. If the plexiglass were higher, the tracking would not be robust enough even though the range of action would be larger. This trade-off has been found experimentally, after hours of practice and multiple different experimental setups. The whole instrument is articulated around this heuristic. It constitutes a threshold for the activation of the sound. When the fingers come into contact with the plexiglass or hover it by less than a centimetre, it produces a sound. The sound will be influenced by various other factors such as speed and fingers trajectory before contact, but will only be triggered with this contact. The second heuristics is the division of the table s surface in several zones, represented by the coloured chessboards (see Figure 23). A zone is a region (coloured in blue, red or green) corresponding to a set of fives notes (e.g.: EFGAB), where each note corresponds to one finger. In a zone, there is no need to move the hand s position in order to play different notes as long as the palm is above a red square. Each finger is tracked and has a set of fixed ID s associated to it. Therefore, when the player touches a zone with a finger, he/she will always get the same note. Each hand covers three zones so one player can cover six zones in total. The zone being associated with the hand s centroid (placed above a red square); there is a total flexibility and position tolerance towards the finger s placement. Having such a system allows the player not to worry too much about his/her fingertips position above the surface of the table but to focus on other parameters such as the velocity and trajectory prior to contact. Figure 23. Table s zones PASR model Following instructions, we successively attempted to model the fingering, the duration and the dynamics. In the pyramid representation, the fingering is represented by a binary function (on/off). However, sound characteristics such as pitch, duration and dynamics all stem from the fingering. This idea leads us to decompose the fingering in multiple phases in order to extract information about the trajectory and the temporality of each fingering. This representation is in four phases: Rest, Preparation, Attack & Sustain, inspired by the PASR (Preparation, Attack, Sustain, Release) model used in the literature [66]. Using this model it is very efficient to segment the fingering into four essential phases. As you can see in Figure 24, each phase has a distinct feature: Filename:d4.2_fv5.docx Page 40 of 65

1) Rest: The hand is at rest and the fingertips touch the table 2) Preparation: One or several fingers lift upward 3) Attack: One or several fingers lower down again at table s level 4) Sustain: The

41 1) Rest: The hand is at rest and the fingertips touch the table 2) Preparation: One or several fingers lift upward 3) Attack: One or several fingers lower down again at table s level 4) Sustain: The fingertip(s) stay(s) at contact with the table Figure 24. RPAS model To sum up, the RPAS model enables us to model the gesture in detail, providing us information on the duration, the trajectories and the speed of each phase. This information can be used in real-time for the mapping. The preparation time, along with the attack velocity can be used to express the dynamics of the sound. After some fine-tuning, we modelled a simple logarithmic (logarithmic velocity = log (Vy)) function that transforms the velocity of the attack with the dynamics. The logarithmic shape allows the system to be reactive and sensitive to small velocity variations but to reach a ceiling when the attack is strong. The sustain time allows to make a note last for a determinate duration. Finally the rest position enables to stop the sound. This way, we can play notes with an intended duration and dynamics. The articulation or stylistics, which corresponds to the legato/staccato difference, requires information prior to the attack. For this matter, we use the preparation time, along with the attack in order to determine whether the sound is legato or staccato for instance Implicit mapping based on head, arms and vertebral axis gesture recognition The second type of mapping, called implicit (Level#2 in the Figure 22 Mapping Methodology of IMI), is based on temporal mapping method, by using a hybrid approach of Hidden Markov Model (HMM) and Dynamic Time Warping (DTW). The basic advantage of this approach is the time warping of the sound that is produced, depending on the speed of the performed gesture in real time. More specifically, it replays sound samples at various speeds according to the gesture performed in real time. Audio time stretching/compressing as well as the re-synthesis of audio can be performed by using granular sound synthesis engine. Particularly, it is based on a method, called mapping by demonstration or temporal mapping [11] which works by associating a sound to a template gesture and links temporal states of a sound with temporal states of the template gesture. Implicit mapping is based on information that is given from head, arms and vertebral axis, meaning the upper body, without including the fingers. This type of mapping includes two phases: Training Phase Recognition Phase In the Training Phase of this technique, the expert creates the gesture template by training the system with the expert gestures. The system allows choosing any pre-recorded sound to associate the gestures to. In this specific step, the expert s gesture is associated perceptually with the musical excerpt. Filename:d4.2_fv5.docx Page 41 of 65

Training Phase and Recognition Phase of implicit mapping This means that the closer the gesture is to the one of the expert, the closer the sound is to the original playback speed.

42 Then in the Recognition Phase, the learner tries to perform/imitate the gesture of the expert. The basis of imitative synthesis/performance is to make a gesture which is representative enough to resynthesize a plausible imitation of the original sound. Figure 25. Training Phase and Recognition Phase of implicit mapping This means that the closer the gesture is to the one of the expert, the closer the sound is to the original playback speed. If the gesture is slower, the sound is played back more slowly and if the gesture is performed faster, the sound is played back faster. If the gesture is much slower or much faster than the template, one can hear artifacts in the resulting sound due to the limits of the granular sound synthesis engine. This temporal deformation of the resulting sound is the feedback given to the user/learner in order to adjust his/her gestures to the expert s one. Summarizing the mapping methodology of IMI and the specific types of mapping that are being developed, perceptual relationships between gesture and emotion with sound should be considered in the mapping design and thus in the methodology. So, in the table below, we propose some perceptual relationships by extracting medium level features. This analysis will also help us to detect different playing styles of a similar musical piece. More specifically, in each column the following are presented: 1 st column: the type of gesture and emotion, 2 nd column: the medium level features which are extracted, (more information in D4.1 First Version of Multimodal Analysis, Fusion and Semantic Media Interpretation), 3 rd column: the values of the features (more information in D4.1 First Version of Multimodal Analysis, Fusion and Semantic Media Interpretation), 4 th column: proposed perceptual relationships with sound Type of gestures and emotions Features Values Proposed perceptual relationships with sound Currenly extracted features Instantaneous sound producing (effective) Fingertip velocity mm/ms Dynamics: Slow/Fast attacks Done Soft/Loud Hand(s) velocity mm/ms Additional dynamics: Done Filename:d4.2_fv5.docx Page 42 of 65

43 Continuous sound producing (effective) Sound modification (effective) Sound facilitating (accompanying) Communicative (Figurative) Inter onset time interval Position/velocity changes during inter onset state Valence (Emotional status) Arousal (Emotional status) Octave selection for right hand Octave selection for left hand Entrained Phrasing Support Theatrical gestures Duration in ms for a given onset mm and mm/ms [1,9] [1,9] Slow/Fast attacks Soft/Loud Sustain (length of sound being produced see ADSR model) After touch effects: Vibrato Dynamics during sustain phase Timbre (quality of sound more analytical in section 6.7) Intensity (more analytical in section 6.7) Done Not yet Done Done [1,3] Pitch Done [1,3] Pitch Done 1 = Parallel movements for Head, Arms and Hand, 0 = None 1 = Head movement, 0 = None Duration in ms of constant angle between forearm upperarm Left or Right hand distance from torso on Z greater than a threshold Rhythm/ Tempo Help accentuate the musical passage Change sound diffusion pattern of the IMI Energy (motion/quantity of body movements) Table 6.1 Perceptual relationships between gesture and emotion with sound Not yet Not yet Not yet Not yet 6.6. Sound synthesis taking into consideration the emotional status Music is well known for affecting human emotional status, yet the relationship between specific musical parameters and emotional responses is still not clear. So there are many researches in order to find relationship between the musical output of our system and the psychological data, as has been discussed in Section Taking into account the literature (Section 3.2.2), we propose the following relationships between sound parameters and emotional status Valence and Arousal (Table 6.3 and Figure 26), which are not yet implemented, but it is a future task for the second version of IMI: Filename:d4.2_fv5.docx Page 43 of 65

44 Proposed perceptual relationships with sound Definition Associated Emotional status Timbre The tone color, or the quality, of a sound (i.e. bright, "warm", "dry", "gritty", etc.) Valence (Emotional status) Intensity Intensity (loudness) and attack (rapidity of Arousal (Emotional status) tone onsets) Table 6.3 Relationships between sound parameters and the corresponding expressed emotional status Figure 26. Relationships between sound parameters and the corresponding expressed emotional status Filename:d4.2_fv5.docx Page 44 of 65

45 7. Technical implementation and software development 7.1. Prototyping the Intangible Musical Instrument We defined the heuristics and the ergonomics directions experimentally, after trying many different possibilities. We wish the configuration to be adjustable so as to be comfortable for the user taking the constraints of the sensors into account. Since the IMI is to be used in both performances and e-learning contexts, it is easily portable, light, solid and foldable. Furthermore, the height of the table should be adjustable according to the player s own height and whether he/she sits on a chair or stands up. Future prototypes will include all these features. This quality of the design and the ergonomics has a strong impact on the player s willingness to play. The Figure 27 puts forward a robust and foldable construction where the position can be adjusted. A space behind the plexiglass is dedicated for computers and the Kinect. Figure 27. Intangible Musical Instrument 7.2. Architecture of the Intangible Instrument The goal of the Intangible Musical Instrument (IMI) is to capture musical expert-like gestures by using multiple sensors, based on different technologies (computer vision, inertial sensors) and to extract upper body and hand skeletons from the depth maps in real-time. The upper part of the body is captured by the long-range depth camera Kinect XBOX 360 and the IGS Animazoo (using 2 inertial sensors for 2 wrists). The Leap Motion Controller captures hand and finger movements. This combination of devices is not mandatory but flexible and scalable, because as it is already mentioned in the Section 6.4 Multimodal data fusion, the system can retrieve data from a various number of sensors. The architecture that is used by IMI is a client-server architecture. It is composed of N machines, called clients, per N sensors/cameras to receive locally the data streaming. There is also a machine, called server, in which the skeleton fusion as well as the data processing is taking place in order to output the visual feedback to the end-user. On a technical level, it follows the client-server model which, when compared to alternative architectural models, provides better control, a higher level of availability and scalability, as well as improved support for structuring and organising. This technical architecture is illustrated in the figure below: Filename:d4.2_fv5.docx Page 45 of 65

46 Figure 28. Architecture of the IMI For the transmission of data the OpenSoundControl Protocol (OSC) is used, which sends data in OSC packets. The machine s synchronization was held by using Network Time Protocol (NTP) server and clients in order to get correct time stamping during the recording session. This results in synchronized streaming from sensors/cameras with the use of timestamps Unified interface for gesture and emotion recognition, mapping and synthesis The aforementioned methodology, which refers to capturing, analysing, recognising data, mapping gesture to sound as well as sound synthesis, is implemented inside Max/MSP programming language. Firstly, data capturing is taking place by using the existing technology (Kinect, Animazoo IGS, Leap motion controller). In order for the data to be streamed inside the Max/MSP in real time, Open Sound Control (OSC) protocol is used, which sends data in OSC packets in the Max/MSP. The udpreceive object was used to receive messages transmitted over a network using the User Datagram Protocol (UDP). It also provides support for third-party Max objects that work with the Open Sound Control (OSC) protocol developed by the Center for New Music and Audio Technologies (CNMAT) at the University of California, Berkeley. Thereafter, depending on the type of mapping, different data analysis is performed. More specifically, for the explicit mapping, finger data (positions in x, y, z axis) are used. According to the proposed methodology in section the user/learner can perform/imitate the expert s gestures. The expert s gestures are ascending and descending arpeggios in order to play the musical sequence. After the completion of the performing gesture, the evaluation of the user/learner is done. Referring to the second type of mapping (implicit mapping), the data is analysed and used for the machine learning phase. The machine learning phase is based on Hidden Markov Model (HMM) and Dynamic Time Warping (DTW) technique [50][51] permitting a time alignment between the model and the data used as input for the recognition. The two phases are: Training Phase in which the expert trains the system with his/her musical gesture, and a sound (pre-recorded one) is associated to the template gesture and links temporal states of a sound with temporal states of the template gesture. Filename:d4.2_fv5.docx Page 46 of 65

47 Recognition Phase in which the user/learner tries to imitate/perform in real time the expert s musical gesture. The meaning of real-time performance and therefore recognition, is that the Gesture Follower does not recognize the gesture once it is completed, but it estimates the gesture in real-time, moment by moment over time. As a result, the Gesture Follower is designed to continuously output information about the gesture, by providing to user probabilistic estimations. Simultaneously, the mapping is taking place, in which the system predicts the sound according to the performed gesture. The musical gestures are ascending and descending scales, which will help the user/learner understand the procedure of moving on the IMI and give him/her a sense of motion. Lastly, in the evaluation phase, the score is computed from the differences between the expert and learner s performance (for more information, please see Appendix I.) Detailed description of the unified interface The unified interface for gesture and emotion recognition, mapping as well as sound synthesis is presented in details below: Figure 29. Unified interface for gesture and emotion recognition, mapping and synthesis Filename:d4.2_fv5.docx Page 47 of 65

48 More specifically: 1. Tick to start the rendering engine The skeleton fusion has been integrated in the unified interface. The process fuses the skeleton coming from Leap and Kinect and outputs a single fused skeleton as an OSC stream. This checkbox is for the control of the 3D engine and rendering statistics. 2. Select fusion settings (i.e. UpperFusion) The user has the ability to choose some settings for the fusion. For example, below is a screenshot (Figure 30) with some of these usual presets, which define which port to be listening to, which skeleton profile to use, and which camera position. Figure 30. Fusion settings Also, some statistics about the incoming data are displayed for feedback purposes. The bang will flash if data are received on the current listened port, and the skeleton data FPS will display the framerate of complete skeleton data (cf. Troubleshooting if the bang is flashing but and skeleton data is 0 FPS). Filename:d4.2_fv5.docx Page 48 of 65

49 Next to the settings shown (in Figure 29), there is a black window in which the visualization of the 3D skeleton will be displayed (also for feedback purposes). 3. Drag & Drop the sound files into each buffer The user has to drag sound files that are the result of real prerecorded expert gestures (in this version the maximum number of the sound files, is two). This step is very important for the implicit mapping, because with this process an association of the sound to the template gesture is created as well as temporal states of the sound are linked with temporal states of the template gesture. 4. Open the sound The user has to be sure that the sound is turned on otherwise s/he will not listen to any sound. 5. Train the system with the expert musical gesture (i.e. 1, 2, etc.) With this checkbox the user can control the Training Phase. This part refers to train the system with expert gestures. Therefore, the expert (or the composer) can create the gesture template by training the system with his/her expert musical gestures. This template can be either a gesture from the vocabulary or a totally new gesture, with which that expert (or the composer) wants to train the system. The number of each expert gesture with which the system is trained is displayed next to checkbox. Also, in the window below the data from the skeleton fusion are also displayed. 6. Recognize the performed gesture This part refers to the Recognition Phase, and specifically to user (learner/performer) who can perform/imitate the same expert gesture that the system is trained before. The basis of imitative synthesis/performance is to make a gesture which is representative enough to re-synthesize a plausible imitation of the original sound. The gesture is being recognized in real-time, meaning the system estimates the gesture in real-time, moment by moment over time. As a result, the system provides to user, continuously output information about the gesture, which is probabilistic estimations (likelihoods). Below the checkbox, the likelihoods refer to two types of outputs: The first one called Instant Likelihood gives in real-time and moment by moment, the probability of the performed gesture against the pre-recorded ones. The second one called Average Likelihood gives the similarity of the performed gesture against the pre-recorded ones, meaning the likeliest gesture. These likelihoods are computed by averaging the instantaneous likelihoods. 7. See the average likelihoods The user can see the number of the gesture that is being recognized. There is also a button with some save settings (below is a screenshot of there settings). Finally, the user can experiment with the option of tolerance, which is the expected error (difference) between the gesture s performance and the pre-recording ones. Filename:d4.2_fv5.docx Page 49 of 65

8. Play with fingertips Figure 31. Save settings This part refers to explicit mapping, by creating a correlation between the fingers and the production of the note.

50 8. Play with fingertips Figure 31. Save settings This part refers to explicit mapping, by creating a correlation between the fingers and the production of the note. The correlation is the following: Gesture features Tip of Thumb Tip of Index Tip of Middle Tip of Ring Tip of Pinky Musical Notes F(53)/C(60)/F(65) G(55)/D(62)/G(67) A(57)/E(64)/A(69) B(59)/F(65)/B(71) C(60)/G(67)/C(72) Table 7.1 Correlation between the fingers and the production of the note 9. Check the values for valence and arousal The values (integer numbers) of valence and arousal are displayed while perfoming the musical gestures. Evaluation part: Check your score of the performed gesture Finally, the user can see his/her evaluation result. The evaluation is different for each type of mapping. In the first part, the user can see the result of his/her expert gesture using implicit mapping, in which the score is computed from the differences between the expert and learner s performance. The second part is for explicit mapping, in which the evaluation score of the user/learner is displayed after the completion of the performing gesture. The score is assessed on the similarity between the score s note sequence and the one he/she just did (for more information, please see Appendix I.). Filename:d4.2_fv5.docx Page 50 of 65

51 7.4. Game-like application Finally, a game-like application was designed in a way to map not only gestures but also the emotions of the learner, where an avatar visualizes the expert s gestures inside a 3D environment (Figure 32). Figure 32. 3D platform with the setup of the Animazoo Inertial Sensors and Leap Motion Sensors The musical game contains two gesture activities, the final challenge and EmoActiv, the emotional game. In the first two activities, observe and practice phase are also being included. In the observe phase the learner can observe the video of an expert performing musical gesture. At that point, where the demos took place, only effective gestures were considered: Ascending and descending scales will help the performer understand the procedure of moving on the intangible instrument and give him/her a sense of motion. We took into account the wrist movement only by using only the Animazoo inertial sensor; Ascending and descending arpeggios involves fingering in order to play the musical sequence. This technique involves a more flexible and precise gesture since it involves fingers. The player can also play with the dynamics by fingering softly or hard and using only the Leap Motion sensor. In the practice mode, the learner imitates the gestures introduced by the virtual expert in the observe mode. In that version of game, explicit (fingerings to notes, e.g. ascending arpeggio) and implicit (dynamic temporal correspondence and warping between gesture and musical excerpts, e.g. ascending scale) mapping sound to gestures were supported by the IMI. The final challenge for gestures includes sequential ascending/descending scales and sequential ascending/descending arpeggios as well. Finally, the sonic performance of the expert becomes a first-person experience for the learner thanks to the fact that his/her gestural performance enables the sonic performance of the expert. The Music Composition game focusing on visualization is described in details in D5.2. Filename:d4.2_fv5.docx Page 51 of 65

Available online at ScienceDirect. Procedia Manufacturing 3 (2015 )

Available online at ScienceDirect. Procedia Manufacturing 3 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Manufacturing 3 (2015 ) 6329 6336 6th International Conference on Applied Human Factors and Ergonomics (AHFE 2015) and the Affiliated Conferences,