Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Volume 19, 2013 http://acousticalsociety.org/ ICA 2013 Montreal Montreal, Canada 2-7 June 2013 Psychological and Physiological Acoustics Session 5aPP: Recent Trends in Psychoacoustics I 5aPP2. Loudness of complex time-varying sounds? A challenge for current loudness models Jan Rennies*, Jesko L. Verhey, Jens E. Appell and Birger Kollmeier *Corresponding author's address: Hearing, Speech and Audio Technology, Fraunhofer IDMT, Marie-Curie-Str. 2, Oldenburg, 26129, Niedersachsen, Germany, jan.rennies@idmt.fraunhofer.de The calculation of perceived loudness is an important factor in many applications such as the assessment of noise emissions. Generally, loudness of stationary sounds can be accurately predicted by existing models. For sounds with time-varying characteristics, however, there are still discrepancies between experimental data and model predictions, even with the most recent loudness models. This contribution presents a series of experiments in which loudness was measured in normal-hearing subjects with different types of realistic signals using an adaptive loudness matching procedure and categorical loudness scaling. The results of both methods indicate that loudness of speech-like signals is largely determined by the long-term spectrum, while other speech-related properties (particularly temporal modulations) play only a minor role. Loudness of speech appears to be quite robust towards even severe signal modifications, as long as the long-term spectrum is similar. In contrast, loudness of technical, strongly impulsive signals is considerably influenced by temporal modulations. For some of the signals, loudness could not be predicted by current models. Since the perceived loudness was underestimated by the models for some signals, but overestimated for other signals, a simple adjustment of the employed time constants in the temporal integration stage could not eliminate the discrepancies. Published by the Acoustical Society of America through the American Institute of Physics 2013 Acoustical Society of America [DOI: 10.1121/1.4799514] Received 18 Jan 2013; published 2 Jun 2013 Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 1

INTRODUCTION Many studies have investigated the loudness of stationary sounds, and different loudness models were derived from these data (e.g., Zwicker et al., 1957; Moore and Glasberg, 1997; DIN, 1991; ANSI, 2007). In general, such models are based on a separation of the input sound into auditory filters, followed by a compression in each filter and a subsequent summation across channels. Such models predict the dependence of loudness on sound pressure level or the effects of spectral loudness summation for complex tones or bandpass noise (i.e., the increase in loudness with increasing bandwidth at the same sound pressure level). Most realistic sounds such as speech or various types of environmental or technical sounds, however, exhibit temporal fluctuations in amplitude, frequency content, or both. Different ways were proposed to extent the stationary loudness models to be also applicable to time-varying sounds (e.g., Chalupper and Fastl, 2002; Glasberg and Moore, 2002; DIN, 2010). Generally speaking, these models calculate loudness for short time windows shifted along the signal, followed by a temporal smoothing over time. This latter temporal integration accounts for the non-instantaneous build-up (e.g. Poulsen, 1981) and decay (e.g., Port, 1963) of loudness over time. The present study describes a series of experiments that was conducted to test existing loudness models with two sets of real sounds. The first set comprised different types of speech-like signals which varied in their similarity to real speech (Rennies et al., 2013). The second set consisted of different technical sounds including stationary sounds as well as sounds with strong impulsive components. Subjective loudness was measured by determining level differences at equal loudness using both categorical loudness scaling and an adaptive matching procedure. The data were then compared to model predictions to investigate if the observed effects could be accounted for by current loudness models and if one of the different model approaches could predict loudness more accurately than the others. Apart from expanding the knowledge about factors influencing loudness of complex sounds, the goal of this contribution was thus to test existing models with respect to their suitability for practical applications such as noise assessment of loudness monitoring. METHODS Procedures Loudness matching Level differences at equal loudness were measured using an adaptive two-alternative forced-choice procedure. Subjects heard two sounds (a reference and a test signal) separated by 500 ms of silence and their task was to indicate which of the sounds was perceived as louder. The order of test and reference signal was randomized for each presentation. The level of the reference signal was fixed at 70 db SPL. The level of the test signal was set adaptively depending on the subject s response (for details, see Rennies et al., 2013). This procedure resulted in an estimate of the level at which the test signal was perceived as equally loud as the reference signal (presented at the reference level). To avoid bias effects, each test signal was matched three times using starting levels of -10, 0, or +10 db relative to the level corresponding to medium loudness as determined by the categorical loudness scaling for the individual sounds. The 42 runs were divided into seven sessions of six runs, which were measured in an interleaved way to further reduce bias effects (Verhey, 1999). No test signal occurred more than once in a single session. Stimuli were generated in Matlab with a sampling rate of 44.1 khz and presented diotically via Sennheiser HD650 headphones. All measurements were conducted in sound-attenuated booths. Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 2

Categorical loudness scaling Categorical loudness ratings were measured using the procedure proposed by Brand and Hohmann (2002), i.e., subjects heard a stimulus presented at different levels and indicated the categorical loudness on an 11-point scale comprising categories between inaudible and extremely loud, which were assigned categorical units (cu) from 0 cu (inaudible) to 50 cu (extremely loud) in steps of 5 cu. The presentation levels and their order were calculated using the algorithm described by Brand and Hohmann (2002). The order of the stimuli was randomized, but the loudness scaling of each stimulus was always finished before the scaling of the next stimulus (no interleaved stimulus presentations). For each measurement, this procedure resulted in a categorical loudness function of cu values vs. sound pressure level. These functions were then used to derive level differences at equal loudness by computing the horizontal distance between the loudness function of the reference signal and all other loudness functions at the cu value corresponding to a level of 70 db SPL for the reference signal (i.e., the reference level; for details, see Rennies et al., 2013). The same apparatus as for the loudness scaling experiments was used. Stimuli Speech-like sounds This set of stimuli contained signals differing in their speech-related properties. Signals ranged from a stationary noise which was speech-like only in its long-term spectrum (in the following named speech-shaped noise, SSN stat. ) to a real intelligible sentence taken from the Oldenburg sentence test (Wagener et al., 1999). In addition, stimuli with different types of temporal modulations were used: a 4-Hz sinusoidally amplitude-modulated speech-shaped noise (SSN 4Hz ), a portion of ICRA5-250 noise (Dreschler et al., 2001), a babble of multiple talkers, as well as a portion of the international female fluctuating masker (IFFM; Holube, 2012). Except for the German sentence, the stimuli were unintelligible to the subjects. This was also true for the German sentence played backwards (Sentence Rev. ) and a Turkish sentence taken from a Turkish speech test (Zokoll et al., 2012). All stimuli were at least 2 s long (cutting was made in the next possible speech pause for signals with speech-like envelopes) and were frozen, i.e., the same stimulus samples were used throughout the experiments. Apart from intelligibility and the type of temporal modulations, the stimuli also differed in their ratio between peak and root-mean-square (rms) value and whether or not they contained components with fundamental frequency or temporal gaps (see Rennies et al., 2013). The third-octave-band spectra of the stimuli were equalized to match the male long-term average speech spectrum (LTASS) of Byrne et al. (1994) in order to eliminate the influence of long-term spectrum on loudness perception. The reference signal for this set of stimuli was always SSN stat.. Technical sounds This set of stimuli comprised nine signals whose waveforms are shown in Figure1. Eight of the signals were technical signals ranging from rather stationary (e.g., jet linear) to highly impulsive signals (e.g., machine gun). The signals jet linear and jet non-linear were taken from Gee et al. (2007). The signal LNN was a low-noise noise (Kohlrausch et al., 1997) with a bandwidth of 50 Hz and a center frequency of 1 khz. This non-technical signal was chosen in order to include a signal with a narrow spectrum and strong tonal component. The durations of the signals varied between 1.6 and 2.6 s. The reference signal for this set of stimuli was always jet linear. Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 3

FIGURE 1: Waveforms of the technical sounds. Subjects Ten normal-hearing subjects participated in the experiments using speech-like stimuli; eleven (partly different) normal-hearing subjects participated in the measurements using technical sounds. None of the subjects reported any hearing difficulties and all had pure tone thresholds not exceeding 15 db HL at audiometric frequencies between 125 and 8000 Hz, except one subject who had a pure tone threshold of 25 db HL at 8 khz at one ear. All subjects participated in both the loudness scaling and in the loudness matching experiment. RESULTS Speech-like sounds Figure 2 shows the mean level differences between the test signals and the reference signal (always SSN Stat. ) at equal loudness including interindividual standard errors for the loudness scaling data (triangles) and the loudness matching data (circles). For four of the seven test signals, level differences were virtually identical for the two methods and very close to 0 db indicating equal loudness at the same level. For three signals, level differences measured using the matching procedure were slightly positive indicating that the respective test signals required a slightly higher level to be perceived as equally loud. In contrast, level differences measured in the loudness scaling experiments were between -5 and -2 db. After a two-way ANOVA had shown that both factors test signal and measurement method as well as the interaction were significant at the 5% level, separate one-way ANOVAs (factor test signal) were conducted for each measurement method to investigate the influence of the different speech-reated properties on loudness. Surprisingly, level differences differed significantly for the data obtained by loudness matching, but not for the data obtained by loudness scaling, although the mean differences were larger for the latter data set, as described above. This was due to the Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 4

fact that interindividual standard errors were considerably larger for the scaling procedure (up to 2 db compared to <1 db for the matching data). FIGURE 2: Measured level differences at equal loudness for speech-like stimuli. Technical sounds Figure 3 shows the mean level differences between the test signals and the reference signal (always jet linear) at equal loudness for both measurement methods using the same symbols as above. Error bars indicate standard errors across subjects. For some of the signals, level differences differed considerably between the two methods (e.g., ratch wheel and plow). The magnitude of the level differences obtained by loudness scaling were always smaller than those obtained by loudness matching. As for speech-like signals, standard errors were larger for the scaling than for the matching data. A two-way ANOVA showed that the factors test signal, measurement method as well as their interaction were significant at the 5% level. Pair-wise post-hoc comparisons for each signal revealed that differences between matching and scaling were significant and a Bonferroni-corrected level for the signals ratch wheel, plow, and jet non-linear. MODEL PREDICTIONS Adaptive loudness matching is designed to provide accurate estimates of level differences at equal loudness at a single reference level. In contrast, the loudness scaling procedure is designed to describe the entire loudness function, i.e., the entire dynamic range of the subject (although only a single point of each loudness function was used in this study). Level differences derived from loudness scaling may therefore be regarded as being less reliable, one indicator for this being the larger standard errors described above. Thus, all model predictions presented in the following were compared to the data of the loudness matching experiment in this study. For all stimuli of the loudness matching experiment, the data were compared to predictions of the dynamic loudness model (DLM) of Chalupper and Fastl (2002), its extension of Rennies et al. (2009), and the model for time-varying loudness (TVL) of Glasberg and Moore (2002). Where appropriate, loudness was also calculated using stationary models. Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 5

FIGURE 3: Measured level differences at equal loudness for technical stimuli. Speech-like sounds Experimental loudness matching data (black) are compared to model predictions (gray) in Figure 4. In addition to the mentioned time-varying models, level differences were also calculated using an algorithm for stationary loudness (DIN, 1991). This algorithm was included to test if the remaining small differences in the long-term spectra after the spectral equalization of the signals (see above) had affected the measured level differences. Crosses in the left panel of Figure 4 indicate that predicted level differences were indeed very close to 0 db, confirming that spectral effects were negligible. The predictions of the DLM and the extended DLM are shown as left-pointing triangles and right-pointing triangles, respectively. For all pairs of test and reference signals, the predictions of both model versions of the DLM were very similar. The comparison to the experimental data showed that only the level difference for babble noise (0 db) was predicted correctly. For all other signals, predicted level differences were negative between about -1.9 db (IFFM) and -4.4 db (ICRA5-250), while the measured level differences were close to 0 db or positive. The corresponding predictions of the TVL are shown in right panel of Figure 4. Squares and diamonds represent predicted level differences based on short-term and long-term loudness, respectively (see Glasberg and Moore, 2002, and Rennies et al. (2013)). Predictions based on short-term loudness were very similar to the predictions of the DLM and its extension, i.e., the same discrepancies between data and predictions were observed. Predictions based on the long-term loudness were generally in better agreement with the matching data. In particular, the magnitudes of the predicted level differences were smaller. In agreement with the matching data for the different sentences, predicted level differences were positive. Technical sounds Figure 5 shows the measured level differences compared to predictions of two stationary models (ANSI, 2007; DIN, 1991, left panel;) and four dynamic models (right panel). Predictions of the two stationary models were generally similar. For three of the eight signals (machine gun, jet non-linear, and helicopter), predictions were reasonably accurate. For all other signals, however, predicted level differences differed largely from the measured data. With one exception (hammer), the loudness of these test signals was considerably underestimated resulting in too Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 6

FIGURE 4: Measured and predicted level differences at equal loudness for speech-like stimuli. high values of predicted level differences (e.g., ratch wheel). Generally, the same trends were observed for the time-varying loudness models. The only exception was observed for snare drum, for which level differences were underestimated by time-varying models and overestimated by stationary models. None of the employed models could predict level differences at equal loudness for all test signals. In particulal, ratch wheel, plow, snare drum, and low-noise noise turned out to result in large discrepancies. FIGURE 5: Measured and predicted level differences at equal loudness for technical stimuli. CONCLUSIONS The data and model predictions presented in this contribution lead to the following conclusions: Loudness of speech is largely determined by the signals long-term spectrum. In particular, speech-like temporal modulations have no effect on loudness. Consequently, models based on short-term loudness cannot predict the data. Instead, slower time-constants as, e.g., employed in the TVL (Glasberg and Moore, 2002) are Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 7

required. For the investigated set of technical sounds, large level differences at equal loudness were observed. None of the employed loudness models could predict the loudness of all signals. A simple adjustment of the models time-constants cannot overcome the discrepancies because, for some of the signals, loudness is underestimated while, for other signals, loudness is overestimated. Apart from these unexplained effects observed for strongly time-varying signals, predictions also deviate from experimental data when a narrowband test signal is compared to a broadband reference in that the loudness of the narrowband signal is underestimated. This is in line with data of Hots et al. (2012). Categorical loudness scaling is less reliable than loudness matching for the measurement of level differences at equal loudness as indicated by the larger observed interindividual variability. ACKNOWLEDGMENTS This study was supported by the Ministry for Science and Culture of Lower Saxony, Germany, by funding of Fraunhofer Project Group Hearing, Speech and Audio Technology. REFERENCES ANSI (2007). ANSI S3.4-2007 Procedure for the computation of loudness of steady sounds, Technical Report, Standards Secretariat, Acoustical Society of America, Melville, USA. Brand, T. and Hohmann, V. (2002). An adaptive procedure for categorical loudness scaling, The Journal of the Acoustical Society of America 112, 1597 1604. Byrne, D., Dillon, H., Tran, K., Arlinger, S., Wilbraham, K., Cox, R., Hagerman, B., Hetu, R., Kei, J., Lui, C., Kießling, J., Nasser Kotby, M., Nasser, N. H. A., Kholy, W. A. H., Nakanishi, Y., Oyer, H., Powell, R., Stephens, D., Meredith, R., Sirimanna, T., Tavartkiladze, G., Frolenkov, G. I., Westerman, S., and Ludvigsen, C. (1994). An international comparison of long-term average speech spectra, The Journal of the Acoustical Society of America 96, 2108 2120. Chalupper, J. and Fastl, H. (2002). Dynamic loudness model (DLM) for normal and hearing-impaired listeners, Acta Acustica united with Acustica 88, 378 386. DIN (1991). DIN 45631 Procedure for calculating loudness level and loudness, Technical Report, Deutsches Institut für Normung, Berlin, Germany. DIN (2010). DIN 45631/A1 Calculation of loudness level and loudness from the sound spectrum - Zwicker method - Amendment 1: Calculation of the loudness of time-variant sound, Technical Report, Deutsches Institut für Normung, Berlin, Germany. Dreschler, W., Verschuure, H., Ludvigsen, C., and Westermann, S. (2001). ICRA Noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment, Audiology 40, 148 157. Gee, K. L., Swift, H. S., Sparrow, V. W., Plotkin, K. J., and Downing, M. J. (2007). On the potential limitations of conventional sound metrics in quantifying perception of nonlinearly propagated noise, The Journal of the Acoustical Society of America 121, EL1-EL7. Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 8

Glasberg, B. R. and Moore, B. C. J. (2002). A model of loudness applicable to time-varying sounds, Journal of the Audio Engineering Society 50, 331 341. Holube, I. (2012). Speech intelligibility in fluctuating maskers, in Speech Perception and Auditory Disorders, 3rd International Symposium on Auditory Auditory and Audiological Research, ISAAR 2012, edited by T. Dau, M. L. Jepsen, T. Poulsen, and J. C. Dalsga. Hots, J., Rennies, J., and Verhey, J. L. (2012). Loudness of sounds with a subcritical bandwidth, Journal of the Acoustical Society of America 131, 3517. Kohlrausch, A., Fassel, R., van der Heijden, M., Kortekaas, R., van de Par, S., Oxenham, A. J., and Püschel, D. (1997). Detection of tones in low-noise noise: Further evidence for the role of envelope fluctuations, Acustica united with Acta Acustica 83, 659 669. Moore, B. C. J. and Glasberg, B. R. (1997). A model for the prediction of thresholds, loudness, and partial loudness, Journal of the Audio Engineering Society 45, 224 240. Port, E. (1963). Über die Lautstärke einzelner kurzer Schallimpulse (On the loudness of single short sound pulses), Acustica 13, 212 223. Poulsen, T. (1981). Loudness of tone pulses in a free field, The Journal of the Acoustical Society of America 69, 1786 1790. Rennies, J., Holube, I., and Verhey, J. L. (2013). Loudness of speech and speech-like signals, Acta Acustica united with Acustica, in press. Rennies, J., Verhey, J. L., Chalupper, J., and Fastl, H. (2009). Modeling temporal effects of spectral loudness summation, Acta Acustica united with Acustica 95, 1112 1122. Verhey, J. L. (1999). Psychoacoustics of spectro-temporal effects in masking and loudness perception, Phd thesis, University of Oldenburg, Germany. Wagener, K., Brand, T., and Kollmeier, B. (1999). Entwicklung und Evaluation eines Satztests für die deutsche Sprache Teil II: Optimierung des Oldenburger Satztests (Development and evaluation of a German sentence test Part II: Optimization of the Oldenburg sentence test), Zeitschrift für Audiologie 38, 44 56. Zokoll, M., Hochmuth, S., Fidan, D., Wagener, K., Ergenc, I., and Kollmeier, B. (2012). Sprachverständlichkeitstests für die türkische Sprache (Speech intelligibility tests for the Turkish language), in Deutsche Gesellschaft für Audiologie, CD ROM (Erlangen, Germany). Zwicker, E., Flottorp, G., and Stevens, S. S. (1957). Critical bandwidth in loudness summation, The Journal of the Acoustical Society of America 29, 548 557. Proceedings of Meetings on Acoustics, Vol. 19, 050189 (2013) Page 9