A natural acoustic front-end for Interactive TV in the EU-Project DICIT

Size: px

Start display at page:

Download "A natural acoustic front-end for Interactive TV in the EU-Project DICIT"

Ashley Morrison
6 years ago
Views:

1 A natural acoustic front-end for Interactive TV in the EU-Project DICIT L. Marquardt a,p.svaizer b,e.mabande a,a.brutti b,c.zieger b,m.omologo b, and W. Kellermann a a Multimedia Communications and Signal Processing, University of Erlangen-Nuremberg, Cauerstr. 7, 9158 Erlangen, Germany b Fondazione Bruno Kessler - irst, Via Sommarive 18, 381 Trento, Italy addresses: {marquardt,mabande,wk}@lnt.de a, {svaizer,brutti,zieger,omologo}@fbk.eu b Abstract Distant-talking Interfaces for Control of Interactive TV (DICIT) is a European Union-funded project whose main objective is to integrate distant-talking voice interaction as a complementary modality to the use of a remote control in interactive TV systems. Hands-free and seamless control enables a natural user-system interaction providing a suitable means to greatly ease information retrieval. In the given living room scenario the system recognizes commands spoken by multiple and possibly moving users, even in the presence of background noise and TV surround audio. This paper focuses on the multichannel acoustic frontend (MCAF) processing for acoustic scene interpretation which is based on the combination of multi-channel acoustic echo cancellation, blind source separation, beamforming, acoustic event classification, and multiple speaker localization. The fully functional DICIT prototype consists of the MCAF, automatic speech recognition, natural language understanding, mixed-initiative dialogue and satellite connection. 1. Introduction The goal of DICIT [1] is to provide a user-friendly multimodal interface that allows voice-based access to a virtual smart assistant for interacting with TV-related digital devices and infotainment services, such as digital TV, Hi- Fi audio devices, etc., in a typical living room. Multiple and possibly moving users can use their voice for controlling the TV, e.g., requesting information about an upcoming program and scheduling its recording without the need for any hand-held or head-mounted gear. This scenario requires real-time-capable acoustic signal processing techniques which compensate for the impairments of the desired speech signals by acoustic echoes from the loudspeakers, local interferers, ambient noise and reverberation. Accordingly, one of the key components for the prototypes developed within the DICIT project is the combination of state-of-the-art multichannel acoustic echo cancellation (MC-AEC), beamforming (BF), blind source separa- This work was partially supported by the European Commission within the DICIT project under contract number tion (BSS), smart speech filtering (SSF) based on acoustic event detection and classification, and multiple source localization (SLOC) techniques. The subsequent sections of this paper are structured as follows: In Sect. 2 we describe the general architecture of the overall DICIT system. The acoustic front-end as a crucial building block of the DICIT system is presented in Sect. 3: We first describe the currently fully integrated front-end which is based on MC-AEC, BF, SLOC and SSF (see also the video at [1]). An alternative approach under development, featuring BSS, MC-AEC and SSF, is presented next. Conclusions and an outlook on next steps and further possible improvements are given in Sect The DICIT System In the following, we first describe the architecture of the overall DICIT system, outline the functionality of its most important components, and briefly describe the used hardware System architecture The main building blocks of the DICIT system are the signal acquisition and playback hardware, the acoustic front-end processing, the automatic speech recognition (ASR) and natural language understanding (NLU) unit, and the actual dialogue manager (DM) as depicted in Fig. 1. Figure 1. DICIT architecture The first block comprises the hardware for signal acquisition and reproduction as detailed in the upper part of Fig. 2. The main components are here the 13-channel microphone array and a multichannel loudspeaker system, capturing the acoustic signals from the environment, and playing back the digitally mixed outputs from the TV and the dialogue system in stereo format, respectively. Note that the TV system comprises a remote control device as well as a set-top box (STB) platform, providing access to on-air satellite signals.

2 The acoustic front-end processing, which will be described in detail in Sect. 3, extracts the desired speech from the microphone signals and passes it to the subsequent ASR. Given the state of the art in robust speech recognition, it is still crucial for the targeted environment to remove to the greatest possible extent any signal impairments due to reverberation, background noise, interferers, and acoustic feedback from the loudspeakers to the microphones and to forward only signal segments to the ASR that can reliably be classified as user speech. Continuous speech recognition technology in DICIT is based on IBM Embedded ViaVoice (EVV) [2]. Acoustic models have been trained in order to optimize the recognition performance given the distant-talking voice characteristics as well as the typical noisy and reverberant conditions of the addressed scenario. The ASR output is interpreted by an NLU unit which employs a statistical modelling called Multi-level Action Classifier [2]. This processing chain has been optimized for the English, German, and Italian languages to enable multilinguality as additional feature of the system. The DM finally manages all interactions with user input and system output and interfaces to external data and devices. Depending on the NLU output or remote control input, the DM is primarily responsible for informationretrieval from the electronic program guide (EPG) and for controlling the TV/STB-System. Feedback to the user is possible by acoustic means via speech generation and visual means via the screen Hardware setup Apart from the microphone array and the loudspeakers, the hardware setup consists of AD-/DA-converters, preamplifiers, the STB and two PCs - the usage of two PCs was commanded by the use of two different operating systems. The first Linux-based PC is equipped with a multichannel digital soundcard and hosts the acoustic front-end processing modules. The second Windows-based PC hosts ASR, NLU, and the DM - communication between the two PCs is established via a TCP-based standard internet protocol. The video signal from the STB is displayed via an LCD screen or video projector. 3. Acoustic front-end The acoustic front-end foresees different combinations of signal processing components. Its configuration depends primarily on computational constraints and the requirements of the specific scenario. The following subsections describe two practically relevant architectures, both featuring MC-AEC but differing with respect to the employed spatial processing and source localization techniques. While the first configuration is part of the current DICIT prototype and uses beamforming and traditional correlation-based source localization, the BSS-based architecture which aims at an extended functionality and reduces the number of microphones, is currently being integrated BF-/SLOC-based front-end The DICIT prototype is based on an acoustic front-end which efficiently combines stereo acoustic echo cancellation (SAEC), BF, SLOC, and SSF. The front-end and its connection to the signal acquisition and playback stage are depicted in Fig. 2. Figure 2. Acoustic front-end (based on BF and SLOC) We first consider the structure of the entire BF-/SLOCbased front-end, before its individual components are described in more detail. While BF extracts the speech signal originating from the desired look direction with minimum distortion and suppresses unwanted noise and interference [3], AEC compensates for the acoustic coupling between loudspeakers and sensors [4]. Since the scenario implies an almost unconstrained and possibly time-varying user position, an according adaptive BF structure was employed. Its combination with the SAEC structure was guided by the principles laid out in [5]: Since applying SAEC to all 13 microphone signals is computationally too expensive, SAEC was placed behind the BF structure. A set of five data-independent beamformers is computed in parallel, which cover possible speaker positions and track moving users by switching between beams. Thereby, the AECs do not need to track time-varying beamformers. Instead of one SAEC behind each beamformer output, only one SAEC is calculated for the beam covering the source of interest. Assuming that beam-switches occur infrequently, the necessary readaptation of the SAEC filter coefficients is

3 acceptable. The reuse of AEC filter coefficients determined for previously selected beamformers further reduces the impact of occasionally switching beams. The selection of the beamformer output to be passed to the SAEC is made by the source localization. As the SLOC needs to use microphone signals which still contain acoustic echoes of the TV audio signals, a-priori knowledge on the loudspeaker positions has to be exploited to exclude TV loudspeakers as sources of interest. Finally, the SSF module analyzes the output of the SAEC in order to detect speech segments from the user. For a robust system it is crucial that only the desired speech segments and no nonstationary noise or echo residuals will be passed to the ASR - the corresponding decision is supported by the SLOC information. As an example, Fig. 3 shows the effect of the front-end processing for a recording containing five control utterances ( ok, set volume to seven, CNN, set volume to five, and show me the EPG ), of a speaker at a distance of 2.5 meters to the microphone array in broadside direction, in a room with a reverberation time of 3msec, a background noise level of 36dB SPL, and real TV audio output. The upper subplot shows a single microphone input while the lower plot depicts the AEC output together with the correct segmentation by the SSF unit. The cancellation of TV loudspeaker echoes is characterized by a mean error return loss enhancement (ERLE) of 28dB calculated over the last five seconds. (The delay between microphone input and AEC output is 2msec.) Microphone signal Segmented signal after BF, AEC, and SSF t [s] Figure 3. Acoustic front-end processing The following paragraphs outline the algorithms that have been chosen and adapted for the described scenario. Beamforming. To account for the wideband nature of speech and ensure good spatial selectivity, a nested arraybased BF design was chosen [6] using 13 microphones for four subarrays, one which uses seven microphones and three of which use five microphones each, with spacings of.32 m,.16 m,.8 m and.4 m, respectively. These subarrays operate in the frequency bands of 1-9Hz, 9-18Hz, 18-36Hz, and 36-8 Hz, respectively. In the acoustic front-end, the BF module consists of a filter-and-sum beamformer (FSB) and five steering units (SU). The FSB based on a Dolph-Chebyshev design (FSB- DC) [7] with FIR filters of length 512 taps was selected here for its good spatial selectivity and its robustness to sensor calibration errors. The steering units (SU) consist of sets of fractional delay filters [8] which perform the steering of the beam to the five predefined look directions. They are inserted after the FSB filtering of the individual channels. Thereby, the FSB filtering of the microphone signals is required only once for all beams and only the delaying and the summation of the microphone channels has to be carried out for each beam. Multi-channel Acoustic Echo Cancellation. The algorithm employed for the current acoustic front-end here is based on the generalized frequency-domain adaptive filtering (GFDAF) paradigm [9]. Exploiting the computational efficiency of the FFT for minimizing computational load, it also accounts for the cross-correlations among the different reproduction channels to accelerate convergence of the filters and, consequently, achieves a more efficient echo suppression. This is crucial in the given scenario as user movements have to be expected, which in turn imply rapid changes of the impulse responses of the loudspeakerenclosure-microphone (LEM) system that has to be identified by the adaptive filters. Since the stereo channels of the TV audio are usually very similar and therefore not only highly auto-correlated but also often strongly cross-correlated, a preceding channel decorrelation (see Fig. 2) allows a further acceleration of the filter convergence. Apart from breaking up the interchannel correlation it is required that the introduced signal manipulations must not cause audible artifacts. For the discussed acoustic front-end the phase modulation-based approach according to [1] has been implemented which reconciles the requirements of low complexity and convergence support with the demand for not impairing subjective audio quality, especially the spatial image of the reproduced sound. Due to the combination of a single AEC with the switched beamformer described above, the AEC sees a different acoustic echo path after each beam-switch. To avoid the need to readapt the AEC filters starting from nonmatching coefficients, the filter coefficients that were identified in the previous use of the respective beam can be used as a starting point for readaptation [5]. In the given scenario, this proves to be very efficient as underlined by Fig. 4, where ERLE is compared for adaptation with (right) and without coefficient buffering (left) following a beam-switch of the DICIT beamformer at t=2s, given continuous TV audio output. Source Localization Acoustic maps, computed on a grid of points in an enclosure, express the plausibility of sound being generated at those points and hence represent a valid

inst. ERLE [db] 3 2 1 5 1 t [s] 3 2 1 5 1 t [s] Figure 4. Effect of beam-switching without and with coefficient buffering solution to the SLOC problem.

4 inst. ERLE [db] t [s] t [s] Figure 4. Effect of beam-switching without and with coefficient buffering solution to the SLOC problem. In particular the global coherence field (GCF) [11], also known as SRP-PHAT [3], combines the information obtained through a generalized cross-correlation phase transform (GCC-PHAT) [12] analysis at different microphone pairs. Given a GCF map, the SLOC problem can be addressed by picking the peaks appearing at the spatial points corresponding to active acoustic sources. In DICIT, the sub-array consisting of seven microphones at.32 m distance is used for GCF computation as it guarantees good performance at a reasonable computational power cost. In order to avoid beam-switching during silence phases and to reduce the impact of false beam-switches due to faulty localization estimates, the SLOC module provides the BF with a new position estimation only if the map peak is above a given fixed threshold. In fact, the amplitude of the peak is correlated with the relevance of acoustic activity and can therefore act as an embedded acoustic activity detector. If the map peak is below the chosen threshold, the previous position is kept. Besides robustness, promptness is a crucial requirement for the module so that the system can quickly steer the beam toward the speaker as soon as he/she starts speaking. A memoryless localization is therefore employed in combination with a post-processing whose goal is to suppress outliers, i.e. isolated estimates located far away from the current speaker area. As the SLOC module operates on the microphone signals still containing the TV echoes, estimating the position of the user requires suppression of the loudspeaker signals. In DICIT, the loudspeaker contributions are removed at GCC-PHAT level by exploiting the knowledge of their positions relative to the microphone array. The approach is derived from the multiple source localization approach in [13], treating the single user plus TV loudspeakers as multiple simultaneously active sources. Fig. 5 shows an example of a GCF map before (left) and after (right) the removal of the loudspeaker contributions (bright colors represent high values, the stereo loudspeakers and the DICIT array are schematically depicted on the right side of each plot). Only after the deemphasis of the loudspeakers, the user position (indicated by the circle) corresponds to the highest activity region as visible in the right plot. Experiments conducted on Wizard of Oz data collected in reverberant rooms [14] show that the SLOC module estimates the source position with an RMS error of 7.5 degrees. Figure 5. GCF map before and after the removal of the loudspeaker contributions Smart Speech Filtering After the signal processing by MC-AEC, sound produced by the TV has been almost completely cancelled from beamformer output, therefore user commands can be detected on the basis of the dynamics of the resulting signal. Constraints are applied concerning minimum duration of utterances and maximum duration of pauses between words in order to isolate potential relevant signal segments. Additionally, only signals segments exhibiting a sufficient spatial coherence at the microphones, possibly produced by a speaker in an area in front of and oriented towards the TV, are retained. Thus, speakers in other areas or not addressing DICIT can be ignored. SLOC information is exploited at this stage in order to take into account both the speaker s position and likely orientation [15] BSS-based front-end The BSS-based front-end to be described in the following represents an alternative approach to the front-end presented in the previous section and is currently being integrated into the overall system. Fig. 6 shows the corresponding block diagram. Figure 6. Acoustic front-end (based on BSS) Since BSS can be interpreted as a set of adaptive null-beamformers, it replaces the functionality of dataindependent beamformers and source localization of the first approach. One major advantage of the BSS-based

5 front-end is the reduction of the number of microphones. For the envisaged BSS-based front-end only two sensors will be needed, which is supposed to be of great importance with respect to the overall system complexity, user acceptance and cost. A second benefit is that, in contrast to the prototype that is based on the front-end described in Sect. 3.1 which can currently extract only one active user, BSS using two sensor signals is also able to extract two simultaneously speaking users. In any case two streams of data will be delivered to the following SSF module, carrying the following signals: If no user is active, two zero-valued signals arrive at the SSF component, If one user is active, its signal will appear in one SSF input and will be attenuated in the other SSF input, If two users are simultaneously active, each SSF input will be dominated by one user signal. BSS can be combined in two different ways with AEC. The AEC can be performed directly on the microphone inputs, or it can be applied at a later stage, to the BSS outputs. Taking into account considerations described in [5, 16], we concentrated on the AEC-first alternative, as shown in Fig. 6. The SLOC module depicted in Fig. 6 represents an additional source of information and might supplement the BSS-inherent source localization and thus also help to improve the decisions to be made by the SSF. SSF first processes the two input streams provided by BSS in order to detect speech segments and reject any non-speech event by means of an acoustic event classifier. Moreover, because SSF here has to work on more than one input stream it is likely that two simultaneously active speakers will create two streams with valid speech segments. Therefore it must be decided which speech signal to pass to the ASR and which one to reject. This decision can be based on a speaker identification. The algorithms for MC-AEC and the related signal decorrelation will be the same as in the preceding Sect The following paragraph introduces the components which are only used within the BSS-based architecture. Blind Source Separation. The extraction of up to two simultaneously active sources with two microphones corresponds to the overdetermined or the determined BSS case, respectively. Approaches based on independent component analysis (ICA) are well suited for both cases, merely under the assumption of statistical independence of the original source signals. Here, we consider a broadband BSS approach based on the TRINICON framework [17]. For the development of the BSS-based front-end, we implemented an efficient second-order-statistics (SOS) version of the TRINICON update rule [18]. While BSS recovers the original source signals from a (possibly reverberant) sound mixture without a priori knowledge about the locations of the sources, the BSS demixing filters also contain information on the source locations. One way to retrieve the localization information has been presented in [19]. It relies on the ability of a broadband BSS algorithm to perform blind adaptive identification of the acoustical environment for two microphone channels. Thus, two time-differences-of-arrival (TDOAs) can be extracted by identifying the highest peaks in the BSS filters, corresponding to the direct paths. Acoustic Event Classification and Speaker Identification for SSF In the foreseen scenario a classification step may be necessary to discriminate actual speech segments from other interfering events (phone ringing, sneezing, laughing...). The foreseen acoustic event classification (AECL) is based on a set of mel frequency cepstral coefficients (MFCCs) as acoustic signal features. A score is computed by comparing the observed feature vector with Gaussian mixture models (GMM), trained on examples of the considered acoustic events. The best match in terms of average likelihood provides the classification of the signal segment [2]. Moreover, when the classified event is speech, it may be necessary to classify the speaker identity as well. In this case a speaker identification (SID) capability must be introduced in the SSF, consisting of the two steps of feature extraction and score computation. The acoustic features are again MFCCs, while the scoring is accomplished by combining the results of two sub-systems implementing GMM and support vector machine (SVM), respectively [21]. In the GMM-based sub-system speaker dependent models are obtained through maximum a posteriori (MAP) adaptation of the mean vectors starting from a universal background model (UBM) that represents the background speaker population. In the SVM-based sub-system, elements belonging to non-linearly separable classes are discriminated on the basis of a binary classification, operated by non-linear kernel functions. When more than one speaker is active, the performance of SID is strongly related to the amount of residual interfering speech that may be present at the BSS output. The effect of BSS on SID performance will be further investigated. 4. Conclusions and outlook In this paper we presented the multichannel acoustic front-end of an already fully functional prototype for interactive TV, which has been developed within the EU-funded project DICIT. We also introduced an alternative architecture based on BSS, which extends the functionality of the BF-/SLOC-based approach for multi-user scenarios. The BF-/SLOC-based front-end supports one user whose movements can be tracked fast enough by the SLOC module, so that the combination of BF and AEC guarantees a good signal quality of the desired speech. The SSF can thus pass the user commands to a subsequent ASR while rejecting undesired residual disturbances. The front-end architecture accounts for computational constraints with an efficient

6 combination of a switched beamformer and AEC. As illustrated by experimental results, AEC filter coefficient buffering has proven a simple but effective strategy to improve AEC performance in the case of beam-switches. During the following months, this prototype performance will be evaluated by 18 test subjects. As a next step, and to overcome the limitation to single-user scenarios, BSS in conjunction with AEC will be used for both extraction and localization of multiple users. This also allows to drastically reduce the number of necessary microphones. For the BSS-based approach an adequate SSF module including speaker identification capabilities will be developed before a first comparative evaluation of both acoustic front-ends. In general, the short utterances that characterize the dialogue provide a persisting challenge to further optimize convergence speed of the involved adaptive filtering algorithms in AEC and BSS and to find decision criteria for beam-switching, localization, and SSF, which provide maximum reliability with a minimum amount of observation data. References [1] [2] J. Huang, M. Epstein, and M. Matassoni. Effective acoustic adaptation for a distant-talking interactive TV system. Proc. Interspeech 8, Brisbane, Australia, September 28. [3] M.S. Brandstein and D.B. Ward Eds. Microphone Arrays: Signal Processing Techniques and Applications. Springer, Berlin, 21. [4] E. Haensler and G. Schmidt. Acoustic Echo and Noise Control: A Practical Approach. Wiley, New York, 24. [5] W. Kellermann. Strategies for combining acoustic echo cancellation and adaptive beamforming microphone arrays. Proc. ICASSP 97, Munich, Germany, 1: , April [6] J. L. Flanagan, J. D. Johnson, R. Zahn, and G. W. Elko. Computer-steered microphone arrays for sound transduction in large rooms. JASA, 78(5): , November [7] W. Herbordt. Sound Capture for Human/Machine Interfaces. Springer, Berlin, 25. [8] T. I. Laakso, V. Vlimki, M. Karjalainen, and U. K. Laine. Splitting the unit delay - tools for fractional delay filter design. IEEE Sign. Proc. Mag., 13(1):3 6, January [9] H. Buchner, J. Benesty, and W. Kellermann. Generalized multichannel frequency-domain adaptive filtering: efficient realization and application to handsfree speech communication. 85(3):549 57, March 25. Signal Processing, [1] J. Herre, H. Buchner, and W. Kellermann. Acoustic echo cancellation for surround sound using perceptually motivated convergence enhancement. Proc. ICASSP 7, Honolulu, Hawaii, April 27. [11] R. DeMori. Spoken Dialogue with Computers. Academic Press, London, [12] C. Knapp and G. Carter. The generalized correlation method for estimation of time delay. IEEE Trans. ASSP, 24(4), [13] A. Brutti, M. Omologo, and P. Svaizer. Localization of multiple speakers based on a two step acoustic map analysis. Proc. ICASSP 8, Las Vegas, USA, April 28. [14] A. Brutti, L. Cristoforetti, W. Kellermann, L. Marquardt, and M. Omologo. WOZ acoustic data collection for interactive TV. Proc. LREC 8, Marrakech, Morocco, May 28. [15] A. Brutti, M. Omologo, and P. Svaizer. Speaker localization based on oriented global coherence field. Proc. Interspeech 6, Pittsburgh, USA, September 26. [16] A. Lombard, K. Reindl, and W. Kellermann. Combination of adaptive feedback cancellation and binaural adaptive filtering in hearing aids. Accepted in EURASIP Journal on Advances in Signal Processing, 29. [17] H. Buchner, R. Aichner, and W. Kellermann. TRINICON: A versatile framework for multichannel blind signal processing. Proc. ICASSP 4, Montreal, Canada, 3: , May 24. [18] H. Buchner, R. Aichner, and W. Kellermann. A generalization of blind source separation algorithms for convolutive mixtures based on second-order statistics. IEEE Trans. SAP, 13(1):12 134, January 25. [19] H. Buchner, R. Aichner, J. Stenglein, H. Teutsch, and W. Kellermann. Simultaneous localization of multiple sound sources using blind adaptive MIMO filtering. Proc. ICASSP 5, Philadelphia, USA, 3:97 1, March 25. [2] C. Zieger and M. Omologo. Acoustic event classification using a distributed microphone network with a GMM/SVM combined algorithm. Proc. Interspeech 8, Brisbane, Australia, September 28. [21] C. Zieger, M. Omologo. Combination of clean, and contaminated GMM/SVM for far-field textindependent speaker verification. Proc. Interspeech 8, Brisbane, Australia, September 28.

WOZ Acoustic Data Collection For Interactive TV

WOZ Acoustic Data Collection For Interactive TV A. Brutti*, L. Cristoforetti*, W. Kellermann+, L. Marquardt+, M. Omologo* * Fondazione Bruno Kessler (FBK) - irst Via Sommarive 18, 38050 Povo (TN), ITALY