Visual enhancement using multiple audio streams in live music performance

Size: px

Start display at page:

Download "Visual enhancement using multiple audio streams in live music performance"

Gwendoline Caldwell
5 years ago
Views:

1 Visual enhancement using multiple audio streams in live music performance Rozenn Dahyot 1, Conor Kelly 1, and Gavin Kearney 2 1 School of Computer Science and Statistics, Trinity College Dublin, Ireland 2 Department of Electronic and Electrical Engineering, Trinity College Dublin, Ireland Correspondence should be addressed to Rozenn Dahyot (Rozenn.Dahyot@cs.tcd.ie) ABSTRACT The use of multiple audio streams from digital mixing consoles is presented for application to real-time enhancement of synchronised visual effects in live music performances. The audio streams are processed simultaneously and their temporal and spectral characteristics can be used to control the intensity, duration and colour of the lights. The efficiency of the approach is tested on rock and jazz pieces. The result of the analysis is illustrated by a visual OpenGL 3-D animation illustrating the synchronous audio-visual events occurring in the musical piece. 1. INTRODUCTION Visual effects such as stage lighting or fog machines are widely used in live music performances to enhance the emotion and mood of the music played. Such schemes are designed to visually immerse the audience in the feeling of the song. Video displays such as TV screens or video projectors are now standard facilities in small to large size venues and recent trends in art tend to design computer programs to automatically allow interactions between the music and the visual effects [1]. Typically in small to medium sized auditoriums, sound reinforcement for jazz and rock ensembles performing on stage involves the use of around 8 microphones, a mixing console and loudspeaker amplification. The microphone signals are pre-amplified and processed at the console and a stereo mix for Front of House amplification is generated. This stereo mix is fed to the lighting desk which allows control over several effects of the stage lights (colour, flash, intensity, direction, etc.). Often artificial intelligence is involved in the making of shows to assist in the work of sound and lighting engineers. For example, one popular automatic process is real-time beat detection implemented on a basic level on lighting desks [2]. The visual effects can then be synchronised to the music. Such algorithms tend to focus primarily on the low-frequency content of the stereo mix to infer tempos since the mid and upper frequency ranges are generally cluttered from the mix of sources. However, current lighting systems do not avail of the multiple audio streams available from digital mixing consoles such as the popular Tascam Digital Interface (TDIF) or Alesis Digital Audio (ADAT) protocols. We propose here to process in real-time a multi-channel audio stream from a digital mixing console to perform reliable lighting enhancement through temporal beat detection and frequency analysis. The advantage of such settings is that the musical content of each instrument is well separated, since, in well engineered performances the sound pressure level of a particular instrument contributes greater than the spill from the other instruments at its corresponding microphone. Thus, no processing for source separation is required for the different instruments. The temporal and spectral characteristics of these signals can then be analysed simultaneously to generate enhanced visual effects. Another advantage of using the separated sources is that the mid to high frequency components which are crucial in determination of signal attack are uncluttered. Thus a high audio resolution is of importance for the accurate detection of pitch and the temporal properties of higher frequency percussive instruments, such as hi-hats, as well as visual enhancement of spatial effects such as reverbs or delays on vocals or guitars for instance. We propose here to create a portable affordable system that will help automatically generate in real-time a visual artistic rendering of the music being played live in 1

2 a small or medium venue without the undesireable budget constraints that face many working artists. As an alternative to lightings, we illustrate our multi-stream music analysis by creating a real-time OpenGL animation that reacts to events in the music piece. Such a system could be used to increase the exposure of not yet well-known artists to international audience in the virtual world (e.g. by simultaneously performing in Second Life Our smart system has been tested on jazz and rock pieces. We show that real-time high resolution multistream music analysis can be performed, with reliable accuracy. We use various methods such as frequency spectrum analysis, beat detection amplitude analysis to get a feel for the mood and tempo of the song (see section 4). An animation is then created representing with more or less accuracy the members of the bands with their instruments on a stage. The lightings and the motion of the characters in the rendering changes in realtime according to the song (see section 5). Section 6 comments on the current performances of our system. 2. RELATED WORKS The work presented in this paper mixed different domains of computer science: digital music processing and computer graphics. In the following paragraphs, references are given to both areas of research Analysis of music Digital processing of music has attracted a lot of attention mainly due to the high commercial value for online sells of songs. A huge literature exists about features that are efficient to process music, and a good review can be found in [3]. Some of them, such as loudness, Fourier transform, energy band or median frequency have been used in our system and are presented in section 4. Beat detection significantly aids the classification of music genre, and is an elementary step for more thorough analysis of the music [4]. Further methods propose to infer the structure of popular music have been proposed by Maddage [4] and include music transition, voice detection and repeated pattern detection. Application of these methods can be found in music transcription, music summarisation and retrieval, and also in music streaming Visual music Perception of music. By stimulating a second sense along with music, visual effects have the ability to contribute to the communication that takes place between performers and their listeners [5]. Examples of visual contributions in music performances include facial expression or body gesture and movement of the performers, video projections, light and pyrotechnic shows amongst others. They amplify the emotion of the music and completely immerse the observer in the feeling of the song. Visuals & Graphics. In the following paragraphs, we report several computer-aided systems that have been proposed to generate visuals for music. Visuals can be created in many ways e.g. by films and lighshows and, as an illustration, the reader can visit the web exhibition visual music [6] that presents several visual expressions explored by artists to extend the perception of music. One important visual cue to music is the natural movement of the body expressed by performers [5] or by the listeners. For instance human foot taps are inferred from the perceived beat. Dancing is also a natural illustration of music. Denman et al. [7] have proposed to synchronise the beat of a song with the visual motion of a dancer in a video (performed for another song). Time-scale changes of the video feed are then performed to synchronise the detected beat of the new song with the movements of the dancer. The beat detection from a monophonic audio stream and the extraction of the motion from the video are performed offline. The created video can have application for generating visuals in nightclubs or for postproduction in music videos. Several applications using music to synchronise in realtime computer graphics animations have been proposed [8, 9, 1]. Applications can be found for entertainment or for learning music In [9], the motion curves is synchronise to music in computer animations. In [1], the graphics animation mimics the expressiveness of a drummer. However, on the contrary to [7], no audio analysis is performed as the relevant cues are already available in the complementary MIDI 1 stream of the soundtrack [8, 9, 1]. MIDI stores the events that would create the sound instead of the sound itself. This allows easy access to pitch, velocity, instrument and timing information. Using a MIDI file instead of a raw audio signal avoids the need of performing digital music processing 1 Musical Instrument Digital Interface Page 2 of 7

3 in realtime. Unfortunatly, this information is not always accessible in realtime from every instrument playing in a musical piece. 3. OVERVIEW OF OUR SYSTEM We consider live performance of small band Rock or Jazz with a few musicians and instruments. On stage, several microphones are placed close to each instrument. We assume the availability of multichannel audio streams from TDIF or ADAT interfaces from such consoles as the Yamaha 2R96 or the standalone analog to digital conversion capabilities of units such as the MOTU 248MK2. Figure 1 show an overview of the system. Several audio recordings coming from different microphones (mic1, mic2, etc.) are available for analysis. The final mix of the song is used only as a soundtrack in the final rendering. to process in real-time but would require computationally expensive routines to separate the contribution from each musician [11]. As an alternative, we propose to take the advantage of the direct out or bussing facilities available on most mixing consoles so that the seperated audio is already presented for analysis. This choice allows us to use simple and fast algorithms to extract reliably relevant music features but has the drawback of requiring the analysis of several audio streams in parallel. 4. MUSIC ANALYSIS Currently four audio channels are analysed simultaneously. These correspond most of the time to the microphones of the singer (voice), the drums, the guitarist and the bass. For the Jazz piece we analysed, the saxophone is selected instead of the guitar. In our simulation, independent audio streams are stored as mono WAV files sampled at 441Hz. Figure 2 shows an example of these different recordings for an extract of a rock song Fig. 1: Overview of our system. Fig. 2: 1 seconds of a rock song. From top to bottom: Pressure signals of singer voice, guitar, bass and drums. The main advantage of considering separate audio sources from each microphones instead of the mixed track resides in the fact that the different sources are well separated. In fact the closest instrument to each microphone is the one mainly audible on the corresponding audio stream little spill from the other instruments ranging in the order of 3 to 5dB less. Using only the mix to analyse the music would lessen the amount of data 4.1. Beat detection The beat detection algorithm is performed onto the drum audio stream. For the audio signal x(t), the loudness is computed for each window of 124 samples (ie = =.232s by: (n+1) l(n ) = x 2 (t) dt, starting at n = (1) n Page 3 of 7

4 The detection of the beat is performed by thresholding the loudness information. To be independent of background noise during the performance or the different loudness of the different sources, the threshold T is adaptative for each frame: T (n ) = γ 1 1 n n 1 x 2 (t) dt (2) where γ is a proportional coefficient set by hand at γ = 1.4 and the normalised integral corresponds to the average loudness in the preceding second before the window n. A beat is then detected when l(n ) > T (n ). Sometimes several successive temporal windows are detected above the threshold. Consequently only the first detected beat amongst a successive sequence of detected beats is actually labelled as a beat (i.e. the rule is that a beat is impulsive and cannot be detected in successive windows). Figure 3 shows the results of our beat detection performed on ten seconds of a rock song performed live. As can be noticed, the detected beat is sometimes one temporal window in advance of the actual peak in the loudness signal. This means that when a beat is detected, it is with =.232 second of accuracy. This temporal precision in the audio analysis is largely sufficient as the visual rendering is only changing every.4 second (i.e. the animation has 25 frames per second). computed every 23.2ms (or 124 samples) as follow: (n+1) X(n, f ) = x(t) exp( 2π ft) dt (3) n Using a adapted passband filter, each instrument is separated from any possible spills coming from other sources, and information such as the band energy of the instrument (or the voice) is recorded: f1 A(n ) = X(n, f ) d f (4) f where [ f, f 1 ] define the frequency band of the instrument. Without too much additional computational cost, the median frequency or the mean of the spectrum is also computed as follow [3]: f (n ) = f1 f X(n, f ) f d f f1 f X(n, f ) d f (5) These are the measures used in our system. Other informative features such as pitch can also be computed if the computation time remains low for the hardware used. 5. REAL-TIME ANIMATION Fig. 3: Result of beat detection performed on 1s of the audio track of drums for a rock song recorded in live session (cf. fig. 2). Red dot indicate detected beat and the blue curve corresponds to the loudness computed every Fourier Analysis For each audio stream, a Fast Fourier Transform (FFT) is A simulation of a stage complete with lighting effects is rendered on screen. This rendering is created and drawn using the OpenGL graphics library. Some of the various graphics methods used in the render include (see figure 4) 3D modelling, texture mapping and Tesselated objects as explained in the following paragraphs D Modeling The musicians and many of the stage props such as the drums, guitars, microphones, lights and light rig were modelled in 3D Studio Max as 3D models and then imported into opengl. Figure 5 shows a screen shot of the drums being modelled in 3D Studio Max. 3DS models are composed of many thousands of vertices and their texture coordinates. They are quite computational expensive to draw. For this reason, there is a trade off in the detail represented in the simulation and the speed at which it runs Texture mapping Texture mapping is the process of applying textures (stored as jpegs) to shapes drawn in order to add color Page 4 of 7

As spot lights will be shining down onto the stage floor, it is required to display and reflect them realistically.

5 5.3. Tesselated objects OpenGL uses the Phong Illumination model to calculate lighting in scenes. This calculates light reflections at each vertex of an object and interpolates the light to surrounding polygons. As spot lights will be shining down onto the stage floor, it is required to display and reflect them realistically. To do this, the stage floor is drawn as a very fine mesh of vertices in a process known as tessellation. However, as it is a computationally expensive process, a trade off in between realism and computation has been found Animation to render the music feel Fig. 4: A screen shot of the render: Vocalist and guitarist are illuminated in the foreground, ambient lighting shines red in the background indicating an uptempo beat. Fig. 5: Modelling the drum kit in 3D Studio Max. and realism to the scene. The less complicated objects in the render such as the enclosing walls of the stage and the front of the stage floor can be presented with far fewer vertices and so are drawn with their static coordinates specified in the code. The texture mapping coordinates are also specified and so at render time, these textures are applied to the shapes to give them a realistic look. This is far more efficient than drawing 3DS models and so is used wherever possible. The visualisation of the information extracted from the music played is done in three ways: 1. Spot lighting. The most important of these is the concentration of spotlighting on any musician who is currently active. This uses information from the FFT performed on each musician s channel. The energies A are calculated and when a certain threshold level is breached, a spot light is shone on the musician. 2. Ambient lighting. Rather than concentrate on individual channels, ambient lighting s focus is on the behaviour of the song as a whole and so concentrates on the energies of all channels. It analyses the predominant frequencies of the FFT (i.e. f )and uses tempo information to attempt to provide ambient lighting in accordance with the mood of the song. The interpretation of the songs is based on the generalisation that lower frequencies and lower tempos indicate a more relaxed mood. This triggers low key colors such as purple or dark red. Brighter colors illuminate the stage when songs occupy higher frequency bands and have faster tempos to reflect the more exited performance. 3. Physical movement of the musicians on stage. Spot lighting will demonstrate that a musician s level has breached a certain threshold and is deemed to be playing or singing. Simple animation of the characters is performed give more information of exactly how loud their part is. This is shown by the speed at which the musicians arms and body move, the faster they move and the bigger a part in the overall mix they are playing. This is done using hierarchial Page 5 of 7

6 Dahyot et al. animation of the 3D models. The models seen in the render are actually made up of several models (body, head, legs etc. ) which are drawn together in OpenGL to simulate a person. These can be rotated and moved around each other to animate movement. This animation is controlled in accordance with the music analysis so their movement is directly linked to the music they are producing. Figure 6 shows two images from a recorded animation. A rock song is played where first only one guitar and drums are playing. Then a second guitar starts soloing. The yellow lighting on the musicians indicate if they are currently playing. The reddish lightings on the soloist illustrate the measure f computed in real-time by changing linearly from blue to red. For a better visualisation, a bar with a moving and changing colour spot indicates the value of f in its range in realtime. 6. PERFORMANCE AND OPTIMIZATION 6.1. Hardware The system as described in this paper is currently running on a standard laptop (Model: HP NX942) with Intel Core Duo CPU at 1.66GHz, 512mb of RAM memory and a graphics card ATi Radeon X16 256MB. The music analysis and the animation are created in real-time 25 frames per second for the video rendering Computational efficiency Computational efficiency of the code is of major issue for our system to work. As a lot of calculation is taking place for the analysis of multi-stream music, we need also a reasonable reaction time in the rendering to avoid desynchroning artefacts. Various methods are used in the graphics component of the project to ensure optimal performance. One such method is the use of hardware precaching of OpenGL display lists that allow some command to be precompiled onto the graphics card s memory and so removes the need for the CPU to perform repeated expensive calculations. This takes advantage of the dedicated memory and computation power of modern GPUs. In a direct comparison between the code with no hardware pre-caching and the code which makes use of display lists, a speed up (measured in the frame per second count) of roughly 4% was achieved. (a) 6.3. Perception of the animation Some results of the system are shown as videos at DemosMusic.html. The system has been successively tested using four simultaneous audio channels from rock and jazz bands, mainly in live performance situations but also with less noisy environments such as in studio recordings. The perceived animation is well synchronised to the beat, in particular the lights and the movement of the drummer. (b) Fig. 6: (a) The singer and one guitarist are not playing and are in the dark. (b) The guitarist is playing a solo, and using the median frequency f, the colour of the red lights varies from blue (low values of f ) to red (high values of f ). AES 31ST INTERNATIONAL CONFERENCE, London, England, 27 JUNE Page 6 of 7

7 7. CONCLUSION AND FUTURE WORK We have presented an innovative system using multichannel music recordings for real-time rendering. Using the computational power of a recent laptop, we have shown how to simultaneously perform music analysis and render a graphic animation expressing some aspect of the music being played. Both CPUs and GPUs abilities have being used to speed up the system. Future directions of this research will look at creating other animations illustrating better the music, such as using changes on facial expressions on one virtual face [5], or by animating a virtual dancer [8], or more generally to create more expressive animations. The music processing part of the system can also be improved by using prior information, for instance for the beat detection where currently no past information is used (i.e. beats are detected without the knowledge when the last beat was detected). The use of other informative audio features such as pitch will also be investigated. ACKNOWLEDGEMENTS Part of this work has been funded by the European Network of Excellence on Multimedia Understanding through Semantics, Computation and Learning MUS- CLE FP , [7] H. Denman and A. Kokaram, Dancing to a different tune, in 2nd IEE European Conference on Visual Media Production (CVMP), 3 Nov. - 1 Dec. 25, pp [8] D. Reidsma, A. Nijholt, R. Poppe, R. Rienks, and G. Hondorp, Virtual rap dancer: Invitation to dance, in CHI 6 extended abstracts on Human factors in computing systems. ACM, 26, pp [9] M. Cardle, L. Barthe, S. Brooks, and P. Robinson, Music-driven motion editing: Local motion transformations guided by music analysis, in 2th IEEE Eurographics UK Conference (EGUK), 22. [1] A. M. Wood-Gaines, Modelling expressive movement of musicians, Master s thesis, MSc Computing Science, Simon Fraser University, February [11] S. Choi, A. Cichocki, H. Park, and S.-Y. Lee, Blind source separation and independent component analysis: A review, Neural Information Processing - Letters and Reviews, vol. 6, no. 1, pp. 1 57, January REFERENCES [1] T. Winkler, Composing Interactive Music - Techniques and Ideas Using Max. MIT Press, [2] U. Sandström, Stage lighting Controls. Focal Press, [3] M. Davy and S. Godsill, Audio information retrieval: a bibliographical study, University of Cambridge, UK, Tech. Rep., November 21. [4] N. Maddage, Automatic structure detection for popular music, IEEE Multimedia, vol. 13, no. 1, pp , 26. [5] W. F. Thompson, P. Graham, and F. A. Russo, Seeing music performance: Visual influences on perception and experience, Semiotica, pp , 25. [6] Visual music, visualmusic/, Hirshhorn Museum, 25. Page 7 of 7

SOUND REINFORCEMENT APPLICATIONS

CHAPTER 6: SOUND REINFORCEMENT APPLICATIONS Though the Studio 32 has been designed as a recording console, it makes an excellent console for live PA applications. It has just as much (if not more) headroom