Convention Paper Presented at the 124th Convention 2008 May Amsterdam, The Netherlands

Size: px

Start display at page:

Download "Convention Paper Presented at the 124th Convention 2008 May Amsterdam, The Netherlands"

Dorthy Stanley
5 years ago
Views:

1 Audio Engineering Society Convention Paper Presented at the th Convention 8 May 7 Amsterdam, The Netherlands The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, East nd Street, New York, New York 5-5, USA; also see All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Versus Multichannel Recording Gavin Kearney, and Jeff Levison Trinity College Dublin, Ireland Euphonix, Inc., Palo Alto, California, USA Correspondence should be addressed to Gavin Kearney (gpkearney@ee.tcd.ie) ABSTRACT We present a comparison of live recordings of a choral ensemble versus dry recordings of the same players, with the acoustic environment reconstructed from impulse responses of the original reverberant performance space. Binaural measurements are used to objectively classify the recordings, and the perceptual attributes are investigated through a series of subjective listening tests. It is shown that the differences between dry recordings convolved with linear time-invariant (LTI) impulse responses and actual acoustical recordings can be perceived by a panel of expert listeners.. INTRODUCTION Natural acoustic recording presents significant challenges for both recording and mix engineers. Quite often, recording in an ideal reverberant space is not practical due to low noise isolation, microphone configuration, equipment and performance issues or simply budgetary limitations. In other cases, musicians often perform overdubs in a dry acoustic space, and mix this audio with naturally reverberant recordings. For all these situations, a common tool used by audio engineers is convolution reverberation, where a dry audio signal is convolved with impulse responses taken from an ideal reverberant environment with the aim of auditory scene synthesis. One important, and often overlooked, practical consideration of such audio processing, is the coloration that occurs on the impulse response due to spectral and temporal differences at each stage of the convolution chain. Loudspeaker and microphone responses and their directivity functions, as well as proximity effects and the test stimulus used all contribute to changing the impulse response from a true representation of the acoustic space. Also, minute temporal fluctuations due to changes in atmospheric

2 pressure can lead to inaccurate representation of the acoustic space in the impulse response []. Another significant factor is that the representation of room responses in this manner assumes that the source-room interaction is one that is stationary and linearly time-invariant (LTI). However, the impulse response changes significantly with changes in the spatial position between the source and the listener. Small movements of musicians can lead to changes in timbre at the listening position []. Vocalists are a particularly good example of this, since the excitation of the room is directly influenced by the directional orientation of the singer. Add to this density changes in the air from movement, temperature shifts, plus other diffraction and propagation effects all merging to result in constant variations of the acoustical response []. Finally, although the perceptual changes of reverberant recordings over different loudspeaker layouts has been well documented, to the authors knowledge, the perceptual differences of recordings made with convolution reverberation against natural recordings over different loudspeaker layouts has not. In this regard, current multichannel recording schemes have not been fully documented as impulse recording mechanisms. Kessler [3] details the use of the INA-5 Spider microphone configuration for use in obtaining 5 channel impulse response measurements. Farina [] also suggests a combinational method of using a binaural head, an Ambisonics Soundfield microphone and an ORTF pair for full periphonic impulse response capture. However, since the advent of commercially available 7. content through Blu-ray discs and evolving download standards such as FLAC (Free Lossless Audio Codec), other multichannel microphone configurations which facilitate 7. recording may be equally applicable to impulse response capture for such higher order surround formats. Of express interest is the microphone configurations useful for 7. periphonic (with height) playback formats [5]. In this paper we investigate these configurations as well as the perceptual issues of coloration and distortion in the convolution chain for synthesis of natural acoustic recordings in 7. reproduction. In particular we investigate:. The perceptual differences between natural acoustic recordings and dry recordings convolved with multichannel impulse responses. The subjective and objective differences apparent between compensated and uncompensated convolution reverberation 3. The subjective attributes associated with the different impulse recording mechanisms. These investigations will be implemented through comparison of naturally reverberant acoustic recordings of a choral ensemble to dry recordings convolved with impulse responses taken from the original performance space. Objective comparison of the recording methods through binaural measurements is followed by listening tests to investigate the perceptual attributes of the recordings. This analysis will be approached by first presenting a succinct review of auditory convolution methods and measurements.. CONVOLUTION AND DECONVOLUTION IN MULTICHANNEL AUDIO A typical time domain convolution measurement chain is given in Figure. An excitation signal s(t) is played through a loudspeaker in the acoustic environment and subjected to the temporal and spectral effects of the room. The signal is recorded and s(t) is then deconvolved out of the recorded signal r(t) to obtain the LTI response ĥ(t). This is the typical s(t) Test Stimulus r(t) g l (t) h(t) g m (t) d(t) Loudspeaker response Acoustic response Microphone response Deconvolution filter Fig. : Typical convolution chain. h ˆ( t ) Estimated impulse response convolution chain that pertains to most common impulse response recording situations. However, in order to obtain a more accurate estimation of the room response, we must compensate for the responses of both the loudspeaker and the microphone. This can be achieved in the deconvolution filter d(t) by computing the inverse of s(t), the loudspeaker response g l (t) and the microphone response g m (t) by D(ω) = S (ω)g (ω) () AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

3 where D(ω) and S (ω) are the Fourier domain representation of the convolution filter and excitation signal respectively and G(ω) = F [g l (t) g r (t)] () The estimate of the room response ĥ(t) can then be obtained by ĥ(t) = F [R(ω)D(ω)] (3) where R(ω) is the Fourier domain representation of the recorded signal r(t). The original excitation signal is of critical importance. Maximum Length Sequences (MLS) are popular choices due to the fact that their autocorrelation is a Kronecker delta function []. However, it is generally accepted that the use of a TSP (Time Stretched Pulse) provides several advantages over the MLS technique, namely that it is more robust to time variances and non-linearities of the room [, 3]. TSPs are an exponential sine tone sweep that when played slowly enough, will result in all speaker induced distortion in the impulse turning into predelayed signals at the start of the impulse response. The length of the tone must be greater than that of the room s reverberation time multiplied by the number of octaves (taken as ). Given the ideal characterisation of room responses in the aforementioned manner it is prudent to ask if the compensation for distortions from the loudspeaker response and microphone response are a major perceptual feature, given that, currently, the majority of impulse responses in recording productions are taken and applied without such compensation. This is of course, a function of the quality of the loudspeaker and microphones used, but the extent to which this may or may not be perceived given reasonable commercially available recording equipment has not yet been investigated. A further aspect of multichannel impulse response measurements is that if the amplitude and arrival times of the impulse are preserved (i.e. no normalization) the source-receiver angles are fixed for each measurement. Thus, if the wish is to construct equivalent recordings from both live and dry acoustic environments, impulse measurements must be taken at each of the performers positions in the live room using the same microphone array. The choice of such recording arrays will now be considered. 3. MICROPHONE ARRAYS FOR MULTI- CHANNEL IMPULSE CAPTURE The microphone array chosen for 7 channel impulse response capture should attempt to satisfy the following criteria under normal recording situations:. It should provide excellent frontal localisation.. It should provide adequate lateral reflections for sensations of envelopment and spaciousness 3. It should maintain good downwards compatibility for lower order playback layouts. In order to successfully meet these criteria, the choice of loudspeaker layout must also be considered. This paper will concentrate on the Front- High layout as proposed by DTS for Blu-Ray discs, and shown in Figure [5]. This is a standard ITU 5. layout with two elevated loudspeakers directly above the L and R front loudspeakers typically at heights of 3 above the main loudspeaker array. Since this is a 7 channel approach, derived from the 5 channel version, the choice of multichannel microphone array can be an augmented version of existing 5 channel recording approaches. LS L LH C -3 o +3 o Elevated - o + o RH R RS Fig. : 7. Front High loudspeaker layout. The use of the 5-channel Williams MMA microphone configuration has been shown by Kessler [3] AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page 3 of

4 to provide a good general approach to impulse response recording and is based on the well known Williams Curves for recording angle calculation [7]. In Kessler s approach, the array is used with equal arm length and constant angles. However, since the interest is in the virtual imaging capabilities for sources with known distances from the array, the related Optimized Cardioid Triangle approach was employed as developed by Thiele and Wittek [8]. This has been shown to provide excellent localisation accuracy by reducing interchannel crosstalk whilst allowing the desired recording angles to be achieved. The array can be extended to a 7 channel version by incorporating rear facing cardioids and upwards facing super-cardioids as shown in Figure 3. BL L cm LH C B 8cm Upwards facing supercardioids RH R cm Fig. 3: Optimized Cardioid Triangle 7 microphone configuration. Another approach is the combination of the popular OCT-3 array with a Hamasaki Square for 7 channel reproduction. Here, the Hamasaki square can be positioned at a diffuse point in the room (in relation to the source position), and using appropriate spatial design can be adjusted to provide satisfying lateral enhancement to the rear playback channels, as well as height diffusion for the front-high channels if desired. Finally, Ambisonics is a well known system for recording and recreating an entire soundfield, and excellent overviews of the system can be found in [9, ]. B-Format Ambisonics offers the ability to capture directional information via a Soundfield microphone in the X, Y and Z directions and is therefore an elegant method for recording impulse responses for use with height systems. Decoding the BR C R L Fig. : OCT-3 and Hamasaki Square Microphone configuration. spherical harmonic components for a given loudspeaker layout can be achieved using hardware or software processing, and for the work presented in this paper is implemented by the VVMic software of David McGriffy []. This is a free software utility for deriving the virtual microphone signals from B-Format files. Of importance for all these multichannel configurations is their placement within the directreverberant field. For good frontal imaging to be achieved, whilst maintaining effective lateral reverberation, the microphone should be placed at or around the critical distance. This is the distance from the source at which the direct and reverberant energies are equal, and is given by m m m m D crit =.57 V (m 3 )/RT (s) () where RT is the reverberation time of the acoustic space [8]. It should be noted that in regular artistic practice the authors agree that the ultimate, final adjustments for microphone placement are done by careful listening and repositioning of the microphones or players based on a subjective approach. That being said, one of the principal goals of this paper is to provide the ability to recreate the experiment with reasonable accuracy so the data can be expanded upon and verified. For this reason, and to the best of the authors abilities, only objective measurements were used to decide the microphone positions. AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

. ACTUAL VS. VIRTUAL RECORDING STRATEGY In order to meet the objectives outlined in Section a formal assessment strategy for both objective and subjective assessment of actual vs.

Natural room recordings were directly compared to close microphone recordings of the same performance, convolved with impulses made with the multichannel microphones.

5 . ACTUAL VS. VIRTUAL RECORDING STRATEGY In order to meet the objectives outlined in Section a formal assessment strategy for both objective and subjective assessment of actual vs. virtual multichannel recording was implemented. This was based on the recording of a small choral ensemble in a highly reverberant environment. Natural room recordings were directly compared to close microphone recordings of the same performance, convolved with impulses made with the multichannel microphones. This method allows for direct comparison of the perceived audio without variation in the artistic performance... Recording Setup The recordings were made at the chapel in Trinity College Dublin, Ireland of the resident Chapel Choir. This chapel is a highly reverberant and diffuse performance environment with a volume of 5m 3. The spatially averaged reverberation time RT of the room was measured at 8 different points in the room (3 stage/altar source positions, 5 audience receiver positions and 3 stage/altar receiver positions) using the Schroeder inverse integral method [] and is shown in Figure 5. The room measures a spatially averaged RT of. seconds at khz. The The main microphones used to capture the performance were an OCT 7 array (consisting of Schoeps CM- cardioids and CCM-V supercardioids) and a Soundfield MK5, with the microphones placed at the critical distance of.m from the centre singer. A Hamasaki Square comprised of AKG C- microphones was placed high up and at the back of the room at a total distance of 33m away from the main microphones, as shown in Figures and 7. 3 RT 3 Frequency Hz Fig. 5: Spatially Averaged RT of Trinity College Chapel. chapel, as with most non-studio recording environments does not have ideal noise isolation from outside traffic and other street noises. Measurement of the LA eq levels showed a dba noise floor inside the church. The choral ensemble consisted of seven singers, performing in front of the altar. Close microphones (Rode NT5s) were used to obtain as dry a version of the performance as possible from each of the singers. Fig. : Microphone Setups: Top - Choir close microphones, Middle - OCT 7 and Soundfield MK5, Bottom - Hamasaki Square. The stage/altar performance space measured a 3.9m width giving a resultant recording angle of 7 from the critical distance. Dimensions for the AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page 5 of

7 Spots 5 3 OCT 7 + Soundfield MK5.m 33m Hamasaki Square Fig. 7: Microphone layout over chapel. OCT array were calculated using the Image Assistant tool developed by Helmut Wittek [3].

6 7 Spots 5 3 OCT 7 + Soundfield MK5.m 33m Hamasaki Square Fig. 7: Microphone layout over chapel. OCT array were calculated using the Image Assistant tool developed by Helmut Wittek [3]. This tool is based on interpretation of Williams curves [7] for desired recording angles and microphone directivity, and gives the localisation for a given loudspeaker layout and listener position. The resultant frontal LCR curves are shown in Figure 8... Impulse Response Capture After the performances, impulse responses were captured using the TSP technique with 3 second long sine sweeps. A Genelec 9A was used for playback of the sweep tones, and measurements were taken for the source at each of the singers positions. Care was taken to ensure that both the height and position of the source matched that of the corresponding singer. For each measurement recordings were taken using the OCT, Soundfield and Hamasaki arrays. From these measurements a data set of 5 deconvolved impulse responses was obtained. For comparative purposes the recorded impulses were compensated for loudspeaker response only and not for coloration of the microphones. However, the recorded spot microphone signals, which would later be used as the feeds to the convolution were compensated for the on-axis frequency response of the Rode NT5. 5. INITIAL ANALYSIS An initial study was implemented for 7. playback to gauge the perceptual effect of coloration due to the convolution chain for a single source. The recordings made using the solo singer were used for this purpose. The test was designed to compare the natural array recordings against the spot microphone recording convolved with the captured impulse responses for:. The OCT 7 natural and impulse recordings,. The B-Format natural and impulse recordings, Fig. 8: Resultant OCT localisation curves based on a 7 recording angle at the critical distance. Recordings were made of several choral pieces with all 7 singers, as well as a solo recording with one singer located at position 3 (centre position in front of the array). The choral pieces recorded were: Sicut Cervus - Giovanni Pierluigi da Palestrina Sanctus - Thomas Tallis. 3. The combination of the OCT-3 and Hamasaki Square natural and impulse recordings. Of particular interest here is the perceptual difference between impulse responses with and without the colouration of the convolution chain. Thus, in the following, the term compensated refers to impulse responses made through application of the inverse transforms of the Genelec 9A and NT5 microphone directional responses (measured in a freefield). Proximity effect was also considered in this pre-processing and a 5Hz -3dB per octave shelf filter was found to give the most satisfactory compensation. AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

7 5.. Subjective Testing of Colouration effects for a single source In total, for this test three entire sets of mixes were made for each set of impulse responses. These were mixes of a solo vocalist made from:. The actual room recordings from the multichannel arrays in the chapel. recordings, created from uncompensated impulse responses from the multichannel arrays 3. recordings, created from compensated impulse responses from the multichannel arrays. The recordings were played back in a treated listening room space over a 7. Front-High layout comprising of 7 Genelec 9A loudspeakers and a Genelec 75A subwoofer. Each of the loudspeakers was calibrated to 79dBC (with +db in-band gain on the subwoofer) at the centre listening position. The listening room is a good monitoring environment with a spatially averaged reverberation time of.3 seconds at khz. In accordance with the ITU-R BS.8- recommendation for listening tests, expert listeners, were chosen for the tests []. Each listener was under 35 years of age, of excellent hearing, and well experienced in musical production. The tests were of a paired comparison type and the sequence in which each audio sample pair was played was random. Background noise at the same signal to noise ratio measured in the hall was introduced to each of the virtual recordings to remove any bias towards the natural recordings in the tests. The participants were given full control of the test and were allowed to play back/compare each sample pair at their leisure. Each participant was first asked to rate the similarity between the recordings on a scale of to 5, where is completely different and 5 is identical. Each answer z i was normalised according to z i = (x i x si ) s si s s + x s (5) where x i is the score of subject i, x si is the mean score for subject i in session s, x s is the mean score for all subjects in session s, s s is the standard deviation for all subjects in session s, and s si is the standard deviation for subject i in session s []. The results presented in Figure 9 show the extent of the perceptual difference between the natural and uncompensated virtual recordings, with the greatest difference and largest deviation from the mean µ occurring for Ambisonics. In contrast, with the 5 3 OCT-7 Ambisonics 3 OCT Hamasaki 7 Fig. 9: Perceptual difference between audio sample pairs: ( = Completely Different, = Dissimilar, 3 = Similar, = Very Close, 5 = Identical). Black = µ (Uncompensated), Red = µ (Compensated), = Standard Deviation. compensated recordings, the distinction between the actual and virtual recordings becomes quite unclear, and higher values of µ are obtained in all cases. Each listener was then asked to identify which was the natural recording and which was the virtual uncompensated one. The results, shown in Figure showed that all listeners could identify the actual Ambisonics recording, with 7% identifying the actual OCT-3-Hamasaki recordings and 8% the natural OCT-7 recording. However, when the actual recordings were compared against the virtual compensated recordings, each listener found it quite difficult to completely distinguish between the two, in particular the OCT-3 + Hamasaki Square. It can be concluded from these results that the differences in uncompensated virtual multichannel recordings are highly perceptible to a panel of expert listeners. However, the compensated versions are not so. It also appears that diffuse reflections play an important role in blurring this distinction. The results obtained for the OCT-3 and Hamasaki square demonstrate this, since the role of the Hamasaki square in this case is to introduce more diffuse reflections to the mixes. AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page 7 of

8 % Identification of Recording OCT 7 Ambisonics OCT 3 + Hamasaki Square OCT-7 Ambisonics OCT-3 + Hamasaki Fig. : % identification of actual recording against uncompensated (black) and compensated (red) virtual recordings. 5.. Objective Testing of Colouration Effects for a single source When the listeners were informally asked as to what they felt was the main cause of the differences between recordings, the listeners quoted the difference in the direct sound level as a major factor, as well as a sense of the virtual recordings being more controlled. In objectively considering these differences in the uncompensated recordings it is first obvious that the measurement loudspeaker has different directional properties to that of the the human voice, thereby significantly changing the direct to reverberant ratio in the impulse response. Secondly, the spot microphone recordings have a greater bass content due to proximity effect to the microphone. Finally, the spectral distortions introduced from the spot microphone and measurement loudspeaker appear as audible changes in timbre. The objective differences in these recordings can be first investigated through examining the signals received at the ear. To achieve this, a Neumann KU Binaural mannequin was situated at the central sweet spot in the listening array and binaural recordings were taken of the playback. The recordings were then compared using the normalised Interaural Cross Correlation Function (IACF). The IACF is a function in the range [-,] which gives a measure of the correlation between the received signals in the integration limits t to t as a function of the time delay τ. t t IACF (τ) = x (t)x (t + τ)dt t t x (t)dt () t t x (t)dt The point at which the function yields its maximum is known as the Interaural Cross Correlation Coefficient (IACC), and is commonly used as a measure of the acoustic quality in concert halls [5]. In general, the higher the value of the IACC, the narrower the perceived source width. Okano et al. have shown the relationship between the IACC and apparent source width as ASW = IACC and have demonstrated its usefulness in measuring the acoustic quality of concert halls []. They show typical ASW values of.5 to.7 for concert hall acoustics. Furthermore the time delay at which the IACC is maximum is representative of the position of the source, although as more reflections are added, the width of the main IACC lobe becomes less defined. Conventionally, this measure is implemented on sets of binaural impulse responses, but as suggested by Mason [7], can also be used as a continuous measure over the length of the musical progression. For each of the recordings the IACC was taken every.5ms and the results plotted in Figure. Since changes in the source width rapidly fluctuate in nonstationary audio, an averaged response over a 35mS window is presented here. It can be seen that the results are very close in terms of the magnitude of the IACC but there are differences between the actual and virtual recordings which lead to subtle changes in the perceived Apparent Source Width (ASW). It was also found that the correlation of the compensated recordings to the actual recordings changes mostly in the OCT case, with the average IACC response remaining highly similar for the other two. However, small IACC fluctuations were still found to exist in all sequences. For the virtual recordings such changes in the IACC measurements can be partly attributed to the averaging function, since the apparent source width will tend to zero between musical phrases. Regardless, the actual and virtual measurements are non-identical and other influential factors, such as the changing directional properties of the performer vs. the static directional properties of the measurement loudspeaker lead to temporal fluctuations in the measured ASW. Such small changes in the ASW indicate fluctuations in the direct source and early reflections at the receiver points. It is also useful to objectively look at the level differences between each binaural pair. This will indicate AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page 8 of

9 .8 -IACC - IACC... OCT-7 Uncomp Comp dba T (Secs) OCT 7 ILD A ILD V -IACC - IACC.8... OCT-7 Ambisonics Uncomp Comp Octave Bands (Hz) T (Secs) dba -IACC - IACC.8... OCT-3 OCT-7 + Hamasaki Square Uncomp Comp T (Secs) Fig. : Apparent Source Width over second of the audio segments. Recording, Recording Uncompensated,... Recording Compensated). dba Ambisonics Octave Bands (Hz) ILD A ILD V the directional characteristic of the source. Measurement of the Interaural Level Difference (ILD) was accomplished by applying A-weighted octave band filtering of the binaural signals. The ILD was then calculated by the power spectrum ratio F(s l (t)) ILD = log db (7) F(s r (t)) where F(s l (t)) and F(s r (t)) are A-Weighted Fourier domain representations of the left and right ear signals respectively. Figure shows the ILDs for the actual and virtual compensated recordings. Also presented here is the magnitude of the ILD error in db between the actual and virtual recordings, ILD E = ILD A ILD V (8) It can be seen in all three cases that there is an average error between the recordings of approximately.8db across the octave bands. Figure 3 shows the ILDs for the actual and compensated virtual recordings. Again, there is a close correlation between the data sets, but it is noted OCT 3 + Hamasaki Octave Bands (Hz) ILD A ILD V Fig. : Averaged A-Weighted Interaural Level Difference of vs. uncompensated recordings ( Sec Average). that ILD E is reduced, indicating that the source imaging of the virtual recordings becomes closer to that of the actual recordings when compensation is applied. However, in all three cases an error still exists between the compensated virtual and actual ILD estimates. Since this examination investigates a source at a position of azimuth and elevation, it is noted that the errors in the level differences across the octave bands are indeed significant, of the order of to db maximum. Since the source in the virtual recording is considered completely static, changes in the ILD can be attributed to the reverberant field AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page 9 of

10 dba dba dba OCT Ambisonics Octave Bands (Hz) OCT 3 + Hamasaki Octave Bands (Hz) Octave Bands (Hz) ILD A ILD V ILD A ILD V ILD A ILD V Fig. 3: Averaged A-Weighted Interaural Level Difference of vs. compensated recordings ( Sec Average). produced. However, the existence of ILD E indicates another influential factor causing spectral fluctuations in the received binaural signals. It should be noted that equivalent Interaural Time Difference errors IT D E were also found, again indicating small differences in source position. It can be concluded from this series of tests that small fluctuations in the IACC and ILD in the binaural spectrum contribute to perceived changes in the convolved source. Compensation for the colouration of the convolution chain reduces ILD E. However, errors still exist after compensation, which indicates that other factors, such as the changing directional response of the performer are responsible for both ILD E and changes in the ASW as well as perceptual differences between the actual and virtual recordings.. SECONDARY TESTS The motivation for the second series of tests was to determine the perceptual differences apparent between actual multichannel recordings and convolved spot recordings of multiple sources, compensated and equalized for loudspeaker and source microphone responses. In these tests, the recordings of the full 7 singer ensemble were used. Each member of the same group of listeners was again individually tested at the centre of the 7. channel array. The tests were performed as paired comparison tests and as before, were presented to each listener in a randomized fashion so as to reduce order effects. From the previous experimentation it was realized that it was difficult to determine the actual from compensated virtual recordings. Instead, each listener was asked to focus on which of the audio sequences they felt sounded more natural. Figure shows that in all three cases, the majority of the listeners picked the actual recordings as the more natural, with the OCT-7 array obtaining the highest score. It is interesting to note that the addition of the highly diffuse Hamasaki Square to the OCT-3 array reduces the naturalness distinction between the actual and virtual recordings. % Preference for Naturalness 8 Same A V S A V S A V S OCT 7 Ambisonics OCT3 + Hamasaki Fig. : % Naturalness preference (A =, V =, S = Same). The listeners were also asked to gauge the imaging accuracy of each of the presentations on a scale of AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

11 Perceived Imaging Accuracy 5 3 A V A V A V 7 OCT 7 Ambisonics OCT3 + Hamasaki Fig. 5: Imaging Rating for audio sequences (A =, V = ). musical content, it is interesting to note that a strong preference existed for the natural recordings in the OCT-7 and Ambisonics cases. The preference for the OCT-3 and Hamasaki square combination seemed undecided, with % choosing their preference as the same for each. This should in no way be interpreted as a poor recording combination, but rather an indication that it is harder to choose between the actual and virtual versions using this setup, which may be due to the extremely diffuse lateral field. to 5, where = Poor imaging and 5 = Excellent localisation. The normalised results are presented in Figure 5. Here, the actual recordings gained preference over the virtual. The OCT-3 and Hamasaki square natural recordings however were shown to have less accurate imaging than the OCT-7, albeit with strong deviations from the mean values. The differences in perceptual envelopment between the virtual and actual recordings was also tested on a scale of to 5, where is bad and 5 is excellent immersion. The OCT-3-Hamasaki recordings and the Ambisonics recordings show close matches in this regard with average envelopment of approximately 3. The actual OCT-7 recording showed a slightly higher rating and this would suggest that accurate vertical imaging also contributes to a strong sense of envelopment. Envelopment 5 3 A V A V A V OCT 7 Ambisonics OCT3 + Hamasaki Fig. : Perceived envelopment for audio sequences (A =, V = ;, O = µ; = Standard Deviation). Finally, since all of the aforementioned attributes are no real indication for overall listener preference, and for the sake of completeness, each listener was also asked which of the audio sequences they preferred. Whilst this parameter is highly subjective and dependent on the recording environment and % Listener Preference 8 Same A V S A V S A V S OCT 7 Ambisonics OCT3 + Hamasaki Fig. 7: % Listener preference (A =, V = ;, O = µ; = Standard Deviation). It can be concluded from these secondary tests that direct source imaging plays an important role in naturalness perception, in particular for imaging of multiple sources. This is supported by the fact that both the naturalness and imaging tests gave very similar results, with OCT-7 receiving the majority vote in both cases. The listener preference test also indicates that the actual OCT-7 recordings are preferred due to their naturalness and good source imaging, although this result is subjective and dependent on the musical context and recording situation. 7. CONCLUSIONS The results shown indicate that whilst the diffuse properties of natural environments can be well modelled with linear time-invariant systems, blind listening tests indicate that the direct-source imaging and early reflections of such systems is not equivalent to that of natural recordings. Furthermore, the common (commercial) methods of direct convolution without compensation are easily apparent to expert listeners when played against equivalent natural recordings. Of the multichannel techniques AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

12 tested, the OCT-7 showed the most promise in terms of imaging and naturalness capabilities for 7. playback. It was also found that the distinction between actual and virtual playback becomes more difficult when a more diffuse reverberant field is introduced. Further work is required to objectively compare virtual and actual recordings in order to improve convolution systems for better virtual imaging. The effects of downmixing on the perceptual attributes of such recordings also requires investigation. 8. ACKNOWLEDGEMENTS The authors gratefully acknowledge the assistance of Helmut Wittek of Schoeps GmbH, Ken Giles of Soundfield Ltd. and Gordon Kapes of Studio Technologies, Inc. Also many thanks to the TCD Chapel Choir and to John Squires and the students of the Music and Media Technology Course, TCD for their assistance in documenting and recording this work. Gavin Kearney also gratefully acknowledges the support of Science Foundation Ireland. 9. REFERENCES [] A. Farina, Simultaneous measurement of impulse response and distortion with a swept-sine technique,. [] David Greisenger, Practical processors and programs for digital reverberation, in 7th International Conference of the Audio Engineering Society, March 989. [3] Ralph Kessler, An optimised method for capturing multidimensional acoustic fingerprints, in AES 8th Convention, May 5. [9] D. G. Malham and A. Myatt, 3-D sound spatialisation using Ambisonic techniques, Computer Music Journal, vol. 9, pp. 58 7, 995. [] M. A. Gerzon, Criteria for evaluating surround-sound systems, Journal of the Audio Engineering Society, vol. 5, pp. 8, 977. [] D. McGriffy, Visual virtual microphone,, http: //mcgriffy.com/audio/ambisonic/vvmic/. [] M. R. Schroeder, New method of measuring reverberation time, The Journal of the Acoustical Society of America, vol. 37, no. 3, pp. 9, 95. [3] H. Wittek, Image assistant., 7, [] The ITU Radiocommunication Assembly, Recommendation ITU-R BS.8- General methods for the subjective assessment of sound quality,. [5] Toshiyuki Okano, Leo L. Beranek, and Takayuki Hidaka, Relations among interaural cross-correlation coefficient (IACC), lateral fraction (LF), and apparent source width (ASW) in concert halls, The Journal of the Acoustical Society of America, vol., no., pp. 55 5, 998. [] Takayuki Hidaka, Leo L. Beranek, and Toshiyuki Okano, Interaural cross-correlation, lateral fraction, and low- and high-frequency sound levels as measures of acoustical quality in concert halls, The Journal of the Acoustical Society of America, vol. 98, no., pp , 995. [7] T. Brookes R. Mason and F. Rumsey, Creation and verification of a controlled experimental stimulus for investigating selected perceived spatial attributes, in AES th Convention, March 3. [] A. Farina and L. Tronchin, Advanced techniques for measuring and reproducing spatial sound properties of auditoria, in International Symposium on Room Acoustics: Design and Science, April. [5] DTS, DTS-HD High Definition audio, 7, [] Jeffrey Borish and James B. Angell, An efficient algorithm for measuring the impulse response using pseudorandom noise, Journal of the Audio Engineering Society, vol. 3, no. 7/8, pp , August 983. [7] M. Williams, Unified theory of microphone systems for stereophonic sound recording, in 8nd AES Convention, March 987. [8] G. Theile, Multichannel Natural Music Recording Based On Psychacoustic Principles, October, IRT, Extended version of the paper presented at the AES 9th International Conference, May. AES th Convention, Amsterdam, The Netherlands, 8 May 7 Page of

Multichannel Audio Technologies

Multichannel Audio Technologies Dr. Gavin Kearney gpkearney@ee.tcd.ie http://www.mee.tcd.ie/~gkearney/mcat Room 23, Top Floor, Printing House What is multichannel audio? 1. A way of expanding and enriching