Motion Analysis of Music Ensembles with the Kinect

Similar documents
Finger motion in piano performance: Touch and tempo

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

Good playing practice when drumming: Influence of tempo on timing and preparatory movements for healthy and dystonic players

Temporal coordination in string quartet performance

Interacting with a Virtual Conductor

Zooming into saxophone performance: Tongue and finger coordination

Musical Entrainment Subsumes Bodily Gestures Its Definition Needs a Spatiotemporal Dimension

Getting Started. Connect green audio output of SpikerBox/SpikerShield using green cable to your headphones input on iphone/ipad.

MICON A Music Stand for Interactive Conducting

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

Beating time: How ensemble musicians cueing gestures communicate beat position and tempo

THE SONIFIED MUSIC STAND AN INTERACTIVE SONIFICATION SYSTEM FOR MUSICIANS

Controlling Musical Tempo from Dance Movement in Real-Time: A Possible Approach

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

Application of a Musical-based Interaction System to the Waseda Flutist Robot WF-4RIV: Development Results and Performance Experiments

Real Time Face Detection System for Safe Television Viewing

Liam Ranshaw. Expanded Cinema Final Project: Puzzle Room

Aalborg Universitet. Flag beat Trento, Stefano; Serafin, Stefania. Published in: New Interfaces for Musical Expression (NIME 2013)

ATSC Standard: Video Watermark Emission (A/335)

Follow the Beat? Understanding Conducting Gestures from Video

Cooperative musical creation using Kinect, WiiMote, Epoc and microphones: a case study with MinDSounDS

Expressive information

ESP: Expression Synthesis Project

COMPUTATIONAL INVESTIGATIONS INTO BETWEEN-HAND SYNCHRONIZATION IN PIANO PLAYING: MAGALOFF S COMPLETE CHOPIN

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

2-/4-Channel Cam Viewer E- series for Automatic License Plate Recognition CV7-LP

2. AN INTROSPECTION OF THE MORPHING PROCESS

Real-time body tracking of a teacher for automatic dimming of overlapping screen areas for a large display device being used for teaching

Music Segmentation Using Markov Chain Methods

Shimon: An Interactive Improvisational Robotic Marimba Player

Enhancing Music Maps

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

BEAMAGE 3.0 KEY FEATURES BEAM DIAGNOSTICS PRELIMINARY AVAILABLE MODEL MAIN FUNCTIONS. CMOS Beam Profiling Camera

Toward a Computationally-Enhanced Acoustic Grand Piano

ZONE PLATE SIGNALS 525 Lines Standard M/NTSC

A STUDY OF ENSEMBLE SYNCHRONISATION UNDER RESTRICTED LINE OF SIGHT

Lab 5 Linear Predictive Coding

CS229 Project Report Polyphonic Piano Transcription

Maintaining skill across the life span: Magaloff s entire Chopin at age 77

Transducers and Sensors

Easy Search Method of Suspected Illegally Video Signal Using Correlation Coefficient for each Silent and Motion regions

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

TongArk: a Human-Machine Ensemble

Introduction. Edge Enhancement (SEE( Advantages of Scalable SEE) Lijun Yin. Scalable Enhancement and Optimization. Case Study:

INSTALATION PROCEDURE

inter.noise 2000 The 29th International Congress and Exhibition on Noise Control Engineering August 2000, Nice, FRANCE

Reducing False Positives in Video Shot Detection

Eddy current tools for education and innovation

METHOD, COMPUTER PROGRAM AND APPARATUS FOR DETERMINING MOTION INFORMATION FIELD OF THE INVENTION

THE DIGITAL DELAY ADVANTAGE A guide to using Digital Delays. Synchronize loudspeakers Eliminate comb filter distortion Align acoustic image.

DISPLAY WEEK 2015 REVIEW AND METROLOGY ISSUE

Evaluating left and right hand conducting gestures

ATSC Candidate Standard: Video Watermark Emission (A/335)

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

HEAD. HEAD VISOR (Code 7500ff) Overview. Features. System for online localization of sound sources in real time

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

Spatio-temporal inaccuracies of video-based ultrasound images of the tongue

Name Identification of People in News Video by Face Matching

Using Audiotape to Collect Data Outside the Lab: Kinematics of the Bicycle*

Robert Alexandru Dobre, Cristian Negrescu

Understanding PQR, DMOS, and PSNR Measurements

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

Music Representations

Measurement of overtone frequencies of a toy piano and perception of its pitch

MONITORING AND ANALYSIS OF VIBRATION SIGNAL BASED ON VIRTUAL INSTRUMENTATION

EddyCation - the All-Digital Eddy Current Tool for Education and Innovation

MOST FORMS OF ENSEMBLE PERFORMANCE SYNCHRONIZATION OF TIMING AND MOTION AMONG PERFORMING MUSICIANS

Table of Contents. 2 Select camera-lens configuration Select camera and lens type Listbox: Select source image... 8

Automatic Music Clustering using Audio Attributes

Using machine learning to support pedagogy in the arts

Precise Digital Integration of Fast Analogue Signals using a 12-bit Oscilloscope

CHARACTERIZATION OF END-TO-END DELAYS IN HEAD-MOUNTED DISPLAY SYSTEMS

Smart Traffic Control System Using Image Processing

The role of texture and musicians interpretation in understanding atonal music: Two behavioral studies

Topics in Computer Music Instrument Identification. Ioanna Karydi

How to Obtain a Good Stereo Sound Stage in Cars

Loudness and Sharpness Calculation

Electrical and Electronic Laboratory Faculty of Engineering Chulalongkorn University. Cathode-Ray Oscilloscope (CRO)

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Proceedings of Meetings on Acoustics

MATLAB & Image Processing (Summer Training Program) 4 Weeks/ 30 Days

The BAT WAVE ANALYZER project

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Simple LCD Transmitter Camera Receiver Data Link

Spectral Sounds Summary

Joint bottom-up/top-down machine learning structures to simulate human audition and musical creativity

Work In Progress: Adapting Inexpensive Game Technology to Teach Principles of Neural Interface Technology and Device Control

Image Processing Using MATLAB (Summer Training Program) 6 Weeks/ 45 Days PRESENTED BY

OEM Basics. Introduction to LED types, Installation methods and computer management systems.

FEASIBILITY STUDY OF USING EFLAWS ON QUALIFICATION OF NUCLEAR SPENT FUEL DISPOSAL CANISTER INSPECTION

Murdoch redux. Colorimetry as Linear Algebra. Math of additive mixing. Approaching color mathematically. RGB colors add as vectors

4. ANALOG TV SIGNALS MEASUREMENT

Lab 6: Edge Detection in Image and Video

VERBIER FESTIVAL JUNIOR ORCHESTRA

AI FOR BETTER STORYTELLING IN LIVE FOOTBALL

ANTENNAS, WAVE PROPAGATION &TV ENGG. Lecture : TV working

Sound visualization through a swarm of fireflies

Calibration of Colour Analysers

CSC475 Music Information Retrieval

Transcription:

Motion Analysis of Music Ensembles with the Kinect Aristotelis Hadjakos Zentrum für Musik- und Filminformatik HfM Detmold / HS OWL Hornsche Straße 44 32756 Detmold, Germany hadjakos@hfm-detmold.de Tobias Großhauser Electronics Laboratory ETH Zürich Gloriastrasse 35 892 Zürich tobias@grosshauser.de Werner Goebl Institute of Music Acoustics (IWK) University of Music and Performing Arts Vienna Austrian Research Institute for Artificial Intelligence Vienna, Austria goebl@mdw.ac.at ABSTRACT Music ensembles have to synchronize their performances with highest precision in order to achieve the desired musical results. For that purpose the musicians do not only rely on their auditory perception but also perceive and interpret the movements and gestures of their ensemble colleges. In this paper we present a method for motion analysis of musical ensembles based on head tracking with a Kinect camera. We discuss first experimental results with a violin duo performance and present ways of analyzing and visualizing the recorded head motion data. Keywords Kinect, Ensemble, Synchronization, Strings, Functional Data Analysis, Cross-Correlogram 1. INTRODUCTION Members of music ensembles have to synchronize to one another with highest precision in order to achieve the desired common musical goal. How musical ensembles achieve such a delicate synchronization is a wide and rich topic for research. Many aspects play a role, such as the musical style, the configuration of the ensemble (piano, string instruments, etc., and perhaps also a conductor or dancers), the experience of the musicians, and many others. Synchronizing requires the musicians to not rely on their auditory perception alone but also to perceive and interpret the movements and gestures of their ensemble colleagues. In order to pursue further research in this direction, we developed a Kinect-based method for motion analysis of musical ensembles. Our method concentrates on head movements which are clearly visible and which the musician may use to communicate with the other ensemble members and the audience. Research in ensemble synchronization could provide new pedagogical insights for ensemble musicians. Furthermore, a better understanding of ensemble synchronization could lead to better computer accompaniment since current solutions [6] are not based on an informed model of (human) ensemble synchronization. Head motion has been already previously shown to play an important communicative role in piano duets [2]. However, those studies have Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NIME 13, May 27 3, 213, KAIST, Daejeon, Korea. Copyright remains with the author(s). used obtrusive sensor technologies such as inertial sensing or marker-based motion capture. This paper contributes a method for motion analysis of musical ensembles based on head tracking from depth camera images. This provides an unobtrusive and affordable method to examine synchronization by movement analysis in musical ensembles. Furthermore, we present first experimental results with a violin duo. 2. RELATED WORK The Kinect has been used in many musical projects such as those described in [9, 7, 11, 12]. Originally, the Kinect was intended for capturing human movement unobtrusively. The standard algorithm [8] that is shipped with the Kinect is based on a decision forest that is trained with an extensive training set. This training set is composed of recordings of actors that were filmed with a depth camera while their movements were simultaneously tracked with a makerbased optical motion capture system. Furthermore, artificial training data was constructed by simulating and rendering human movement. This is possible since the depth information is much less variable than RGB information usually varying between users due to different clothing and and different lighting conditions from recording to recording. The method shipped with the Kinect is not suited for capturing instrumentalist movements, since such conditions (having a violin in the hand, sitting at the piano) were not reflected in the training set. It would be possible to adopt the approach and construct a training in order to apply Shotton s et al. method [8]. However, the large effort to construct such a dataset makes such an approach unpractical for musical applications. Therefore, other solutions have to be found for musical applications, such as for capturing pianist movements [3]. In this paper we provide a method for analysis of head movements in music ensembles. In contrast to [3], which provides unobtrusive pianist motion capture of a large range of joints of a single person, we detect the head movements of multiple ensemble members. Furthermore, our method determines not only the head position but also the viewing direction of the performers. We report first experimental results and data analysis with a violin duet performance. 3. IMAGE ANALYSIS Setup & Recording: A Kinect depth camera is mounted facing downwards so that it records the music ensemble from above (see Fig. 1). The optimal height of the Kinect is empirically determined with the ensemble in place so that the heads of the ensemble members are always visible during the performance of the piece, taking into account head

Figure 1: The raw image provided by the Kinect. Darker areas are closer to the camera; lighter areas are farther away. The heads and the bow tips are closest to the camera. Figure 2: Neighborhood around the candidate head pixel. The rectangle is spanned by 1 pixels in each direction. swaying motions that are typical during instrument performance. Our analysis algorithm assumes that the heads of the ensemble members are the highest areas in the depth image (i.e., closest to the camera). Therefore, the recording area has to be free of other high objects. The depth camera images are recorded in a lossless format for later analysis at a frame rate of 3 frames per second. Algorithm overview: We track the head positions of the ensemble members in order to provide means to analyze gestural ensemble communication and examine movement synchronization of the ensemble members. The head seems to be well suited for expressive performance analyses as shown by previous work [1]. The swaying motion of the head, which is a compound movement by the entire body, is well visible and has usually no specific function in operating the instrument. It is therefore available for communication with the audience and ensemble members. In order to make the most from the depth data, the direction of the head (which is an indicator for the viewing direction) and the position of the head of all ensemble members are tracked. The design of the analysis algorithm takes into account computational efficiency to enable future use in real-time interactive computer music projects. The image analysis consists of 2 steps, which will be discussed in the next sections: head position detection and ellipse matching. 3.1 Head position detection The Kinect measures depth by projecting an infrared dot pattern into the space. The dot pattern is recorded with an infrared camera. By identifying the dot patterns in the image and evaluating the distance between the dots, the distance from the camera can be determined [1]. The raw Kinect depth image can be seen in Fig. 1. The different shades of grey correspond to different distances from the camera. Darker colors (i.e., lower values) correspond to points that are close to the camera; lighter colors correspond to points that are further away. Due to shadows and reflections, it is not always possible to determine the distance. Areas, in which the distance measurements fail, are marked with zero values, visualized as black areas in the raw data image. The heads are the highest areas in the image. In order to find the first head, the highest point in the image is identified by iterating through the depth values. It sometimes happens that the bow tip is even higher than the head. Figure 3: The shaded area centered around the head position of the taller player is excluded in order to detect the head position of the second player. In order to filter out such values, the neighborhood of the candidate head pixel is examined. The neighborhood is a rectangular area centered around the candidate head pixel (see Fig. 2). If the candidate head pixel is really the highest point on the head then the surrounding pixels will also be head pixels and thus have very similar depth values. On the other hand, if the candidate head pixel is in fact a bow tip pixel then only some of the surrounding pixels will be bow pixels and many other pixels will be floor pixels and have distinctively different depth values. By examining the fraction of pixels in the neighborhood that have similar depth values to the candidate head pixel, these two conditions can be differentiated effectively. The head position of the tallest ensemble member is determined through the method described above. In order to detect the head position of the next ensemble member, the same method is repeated. However, a large rectangular area centered around the the previously detected head(s) is excluded from the analysis (see Fig. 3). The overall process is continued until all ensemble members are detected. 3.2 Ellipse matching In the previous step, the approximate positions of the heads of the ensemble players were detected. In this step the position of the head is refined and the head direction is determined. First, all head pixels of each player are determined. This is done by comparing the depth values of the surrounding pixels in a rectangular area with the depth value of the highest head pixel determined in the previous step. If the

Figure 4: Head pixels are detected in a rectangular area around the highest head point based on the depth difference. Sometimes bow pixels are incorrectly labeled as head pixels. Figure 5: The contours of the head (upper) and the matched ellipses (lower) a one-to-one correspondence to the pulse of the music. Although the trajectories of player B (blue) are not strictly periodical, they show a high regularity, grouping time into small chunks according to the fine-grained musical structure of the piece. Player A s movements (red) on the other hand are freer and less regular. Judging from the motion graphs alone it seems that player B (blue) has the lead in controlling the ensembles tempo, as evident by the busier graphs and more regular motions. We did not detect any systematic variation in the diagram showing the viewing direction. Acceleration, the second derivative of position, has been shown to contain visual information on timing cues used in ensemble performance, particularly in conducting gestures [4]. Therefore the x and y position data were converted to a functional form using Functional Data Analysis [5] in a further analysis step. Order 6 b-splines were fit to the second derivative (acceleration), with knots placed every 5 data points, and smoothed using a roughness penalty on the fourth derivative (λ = 1 5 ), which smoothed the second derivative (acceleration). Head acceleration of x and y was combined by taking the root of the summed squares of x and y acceleration trajectories. The compound head acceleration (indicating head acceleration in any direction) is plotted in Fig. 7 (top panel) for both players. To elucidate any fine-grained temporal relationships in the two musicians head movements, we computed multiple cross-correlations between the two compound head acceleration trajectories. The bottom panel of Fig. 7 shows the color-coded coefficients of cross-correlations calculated on windows of 3.33 seconds (or 2 samples at a re-sampling rate of 6 fps) shifted 12.5% sideways, resulting in about 2.5 analyses per second. Red colors reflect regions of high correlation (in-phase movements between the musicians) while blue colors show negative correlations (anti-phase motion). Negative lags (in seconds) mean that A s head movements lead the others movements, while positive lags point to the opposite (B s movements anticipating A s movements). This cross-correlogram reveals longer regions of dark red color: from about 13 24 s player A seems to anticipate the other by about half a second, while the opposite occurs between 36 s and 47 s. This novel way of presenting motion synchronicities over time may represent a powerful analysis tool to unseal otherwise hidden motion relationships between performing musicians. depth difference of the pixel amounts to only a few centimeters, it is recognized as a head pixel (see Fig. 4). Sometimes bow pixels are located within that rectangular area and are labeled incorrectly. To avoid this problem a contour detection algorithm is used. This algorithm finds the contours of the regions of connected pixels. The largest contour is then recognized as the head contour. This provides an effective way of differentiation since the contours originating from the bow are rather small. An ellipse is matched onto the contour of the head (see Fig. 5). The center point and the direction of the matched ellipse correspond to the center of the head and the head direction. 4. EXPERIMENTAL RESULTS We recorded two violinists performing a short piece with a Kinect camera mounted above the musicians. The head position and orientation was extracted with the above algorithm. The resulting head position and head orientation trajectories are plotted in Fig. 6. The forward-backward (y) and the sideways motion (x) of both musicians do not adhere to a strict period as one would expect if there was 5. CONCLUSION The members of a musical ensemble have to synchronize one another with highest precision to achieve the desired musical goal. The musicians do not only rely on acoustic information but also anticipate timing and communicate with each other based on gestures and movements. There has been quite some research on ensemble synchronization (see [2] for a discussion of existing works). However, up to now motion analyses with ensembles have been performed using intrusive technologies, such as inertial sensing or marker-based optical motion capture systems. Particularly the latter are very expensive in both prime costs and data evaluation. In this paper we proposed a head tracking method using a Kinect depth camera which is both very inexpensive in its prime costs and, even more importantly, and unobtrusive in the sense that it does not require markers to be glued on the participants. Furthermore, we have demonstrated the opportunities of our motion tracking method for head motion analysis revealing complex interaction patterns hidden in the complex kinematics of musicians body motion. Future work will evaluate this tracking and analysis method

angle (degrees) x player B (pixels) -4 4 48 52 56 x player A (pixels) y (pixels) 18 22 26 1 14 18 y (pixels) 8 12 16 2 1 2 3 4 5 6 Time Figure 6: The head position trajectories of player A (red) and player B (blue). The first two diagrams show the forward-backward motion of the musicians (along the image y-axis). The next two diagrams show the sideways motions of the musicians (along the image x-axis). The last diagram shows the musicians head orientation (an indicator for viewing direction). The horizontal gray lines crossing all diagrams are placed at maxima and minima of player B s forward-backward motion (the second diagram).

Acceleration (pixels/s/s) 3 25 2 15 1 5 Combined x/y Acceleration (pixels/s/s) Player A (left) Player B (right) 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 36 38 4 42 44 46 48 5 52 54 56 58 6 62 64 Time (seconds) 3 (A ahead) 2 Cross correlation Coefficients (Lag over time) In phase.6.4 Lag (seconds) 1 1 2 (A behind) 3 2 4 6 8 1 12 14 16 18 2 22 24 26 28 3 32 34 36 38 4 42 44 46 48 5 52 54 56 58 6 62 64 Time (seconds).2.2.4.6 Anti phase Figure 7: Violin duet performance: Compound head acceleration (in pixels/s 2 ) against time in seconds (upper panel) and cross-correlation coefficients (color-coded) for lag (in seconds) over time (in seconds). Regions of dark red indicate kinematic in-phase relationships at various lag times. in controlled real-life experiments. Another path of extension is to enable the algorithm to capture and analyze data from multiple daisy-chained and synchronized Kinect cameras. This would enable us to monitor larger ensembles up to an orchestra and explore the widely unknown kinematic dynamics of music expression and communication evolving during performances of large music ensembles. 6. ACKNOWLEDGMENTS We thank Anne Weber for performing in the violin duo recording sessions. This work is in part supported by the Austrian Science Fund (FWF project P 24546). 7. REFERENCES [1] G. Castellano, M. Mortillaro, A. Camurri, G. Volpe, and K. Scherer. Automated analysis of body movement in emotionally expressive piano performances. Music Perception, pages 13 119, 28. [2] W. Goebl and C. Palmer. Synchronization of timing and motion among performing musicians. Music Perception, 29. [3] A. Hadjakos. Pianist motion capture with the Kinect depth camera. In SMC 212, 212. [4] G. Luck and J. A. Sloboda. Spatio-temporal cues for visually mediated synchronization. Music Percept, 26(5):465 473, 29. [5] J. O. Ramsay and B. W. Silverman. Functional data analysis. Springer, New York, 2nd edition, 25. [6] C. Raphael. Current directions with musical plus one. In SMC-9, 29. [7] S. Şentürk, S. W. Lee, A. Sastry, A. Daruwalla, and G. Weinberg. Crossole: A gestural interface for composition, improvisation and performance using Kinect. In NIME-12, 212. [8] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake. Real-time human pose recognition in parts from single depth images. In CVPR, volume 2, 211. [9] S. Trail, M. Dean, G. Odowichuk, T. F. Tavares, P. Driessen, W. A. Schloss, and G. Tzanetakis. Non-invasive sensing and gesture control for pitched percussion hyper-instruments using the Kinect. In NIME-12, 212. [1] Wikipedia. Kinect Wikipedia, the free encyclopedia, 213. [Online; accessed 25-April-213]. [11] Q. Yang and G. Essl. Augmented piano performance using a depth camera. In NIME-12, 212. [12] M.-J. Yoo, J.-W. Beak, and I.-K. Lee. Creating musical expression using Kinect. In NIME-11, 211.