AUD 3 Speech Science Dr. Peter Assmann Spring semester 2 Role of Pitch Information Pitch contour is the primary cue for tone recognition Tonal languages rely on pitch level and differences to convey lexical meanings within syllables Pitch helps to segregate auditory components from different sound sources Pitch and consonant voicing Voice pitch is higher following a voiceless consonant compared to a voiced consonant. Listeners perceive these small changes; voicing judgments are influenced by F of the following vowel. a d a (lower F) a t a (higher F) Size variation in natural speech Adults voices Fundamental Frequency Formant Frequencies Children s voices Fundamental frequency (Hz) F as a function of age and sex 3 2 1 1 Age (years) Geo. Mean formant frequency (Hz) FFs as a function of age and sex 1 1 Age (years) 1
Pitch and vowel identification There is a systematic relationship between F and formant frequencies across voices (low-pitched voices tend to have lower formants than high-pitched voices and vice versa). log mean FF (Hz) Relationship between F and FFs 2 R 2.4 p 4.e-4 N 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 2 3 log mean F (Hz) Pitch and vowel identification Frequency-shifted speech is more intelligible and perceived as more natural when the normal co-variation of F and formant frequency is preserved, even when frequency shifts approach or exceed the range found in human speech. 2
Identification accuracy (%).1. 1 1.2 1.4 2.1.2. 1. Sentences Vowels 2. 4.. 2 2 Geo. mean F1-F2-F3 (Hz) Comparison of vowel identification accuracy (red circles) and sentence recognition (blue circles). Geometric mean formant frequency (Hz) Original (unscaled) voices FF=.2F+. r=.; N=2 1 1 Medians of vowels per talker 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Scatterplot of vowel identification accuracy and 1 HINT sentence recognition. 1 2 2 3 3 4 Geometric mean F (Hz) Geometric mean formant frequency (Hz) Age-transformed (gender preserved) FF=.2F+. r=.; N=2 1 1 1 1 1 Medians of vowels per talker 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 Geometric mean F (Hz) 1 1 1 Geometric mean formant frequency (Hz) Age-transformed (gender swapped) FF=.2F+. r=.; N=2 1 1 1 1 1 Medians of vowels per talker 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 3 3 4 Geometric mean F (Hz) 1 1 1 Cocktail Party Effect Colin Cherry (13) coined the term cocktail party effect to describe the ability of listeners to attend to a single talker in a mixture of conversations and other background noises. Cocktail Party Phenomenon Cherry s experiments involved listening to two different messages presented to one or both ears, on the same pitch or on different pitches, spoken by two talkers of the same gender or differing in gender. 3
Cocktail Party Phenomenon Cherry concluded that listeners rely on several cues to follow a conversation in the presence of competing voices, including Spatial separation Pitch differences Gender differences Auditory scene analysis The sound that reaches the eardrum of the listener is often a mixture of different sources Acoustic signals originating from different sound sources combine additively Unlike vision, the concept of occlusion is hard to define in audition: sounds overlap but also combine in complex ways. Auditory scene analysis (ASA) is the process by which the auditory system organizes complex mixtures of sound. ASA involves grouping processes, in which sound components that are likely to have come from the same environmental source are linked to form a single perceptual unit. Auditory scene analysis Computational auditory scene analysis Reviewed by Cooke and Ellis (21) Human listeners are good at separating mixtures of sounds, as reflected in speech communication and listening to music in complex listening environments (cocktail parties) Attempts to reproduce this separation process using computational models had limited success (a hard problem!) Demonstration: when a sequence of tones, made up of alternating low and high notes is played to listeners at a slow rate they hear low and high tones alternating. At high rates they hear two separate streams, one high and one low. Bregman used the term stream segregation to describe the auditory processes that group together sounds that share common features and segregate them from sounds that differ. 4
Stream segregation: when the sequence split into two separate streams it is difficult to attend to the low and high streams at the same time. It is much easier to listen for a subset of 3 tones from the sequence when they belong to the same stream. Stream segregation: At slow rates we hear alternating low and high notes. At faster rates we hear two separate streams of low and high notes. Stream segregation: A B Standard= repeating 3-tone cycle Left panels: withinstream Right panels: acrossstream Which set (A or B) preserves the standard more effectively? Auditory grouping principles figure-ground phenomenon proximity good continuation closure common fate old-plus-new heuristic Grouping and segregation are complementary processes Sequential grouping Simultaneous grouping
Grouping by timbre Grouping by timbre Tones that deviate from rising/falling pattern pop out of sequence Tones that deviate from rising/falling pattern pop out of sequence Grouping by onset Harmonics of speech sounds or music that start and stop at the same time are grouped together Gestalt law of common fate Grouping by onset Rasch (1) showed that it is easier to distinguish two tones from one another when onset of the first precedes the onset of the second by a short time interval Tone 1 Tone 2 Time Grouping by onset Darwin (11) showed that a harmonic which starts or stops before the remaining harmonics of a vowel is (partially) excluded from the vowel percept. Grouping by onset Darwin (11) Harmonic # onset is earlier than any of the remaining harmonics; Vowel quality shifts in the direction of lower F1 (as if the harmonic had been removed) Synchronous harmonics Asynchronous harmonics
Schema-based grouping When one harmonic is gradually ramped up in level we hear a slight, gradual change in vowel quality Schema-based grouping When the harmonic is augmented by > db relative to its original vowel, we hear a slight change in vowel quality and we hear an extra superimposed tone. Uniform amplitude harmonics Ramped harmonic # Normal amplitude harmonics harmonic # augmented db harmonic # augmented db Principle of good continuation When sounds are interrupted by silence (e.g., signal dropouts or faulty communication lines) or by interfering sounds, the sound is heard as continuous (the auditory system fills in the missing pieces). Periodicity and noise Hypothesis: Periodicity of speech contributes to robustness Harmonicity in the frequency domain Across-frequency grouping of spectral features Unvoiced sounds (e.g., whispered speech) are more susceptible to masking and interference by competing sounds F and voice separation When two people speak at the same time, it is often easier to understand what they say if the pitches of their voices differ, for example if one voice is male and the other is female. F and voice separation Hypothesis 1: voice separation becomes easier and intelligibility improves as the pitch (F ) difference between the voices is increased.
F and voice separation Voice pitch is rarely constant in natural speech, but changes over time (melody of speech, or prosody). Time variation in voice pitch may help listeners to track a target voice in a mixture of voices. F and voice separation Hypothesis 2: voice separation is easier when the natural variation in pitch (F ) is present, and becomes more difficult when the pitch is held constant (monotone). F and voice separation A high quality speech vocoder was used to construct pairs of sentences on different F s. The F difference between the two voices was manipulated (, 1, 2, 4, or semitones). F was either constant or variable (natural pitch contour) Example stimuli: 2 1 1 Hz Results 2 1 1 Hz Results Significant improvement with F. No effect of F modulation. No interaction of F x F modulation. Marginally higher scores for intoned sentences at and 1 semitones may stem from momentary differences in F between the sentences. 1 st F 2 nd F
Brungart et al. (21) Brungart et al. (21) 2-talker correct responses (%) Different Modulated talker, talker, Same different talker same noise sex 3 3 Target-to-Masker Ratio (db) 2-talker correct responses (%) 3 3 Target-to-Masker Ratio (db) Midterm take-home exam In a recent review of the literature on clear speech, Smiljanić and Bradlow (2) describe a speaking style called clear speech and its effect on intelligibility in different populations of listeners. Your assignment is to review the literature on clear speech, using both primary and secondary sources to summarize what is known about clear speech and what these findings imply for speech production and perception. Midterm take-home exam Identify a set of organizing principles or themes that emerge from current research and use these as section headers for your paper. There is no set length for the paper, but - (double-spaced) pages is a reasonable target. Midterm take-home exam Two review papers to get started: 1. Smiljanić R., Bradlow A.R. (2). Speaking and Hearing Clearly: Talker and Listener Factors in Speaking Style Changes. Lang Linguist Compass. 3(1): 23 24. 2. Uchanski, RM. Clear speech. In: Pisoni, DB.; Remez, R., editors. The handbook of speech perception. Malden, MA/Oxford, UK: Blackwell; 2. p. 2-3.