SPECOM 1505 No. of Pages 33, Model 3+ ARTICLE IN PRESS. Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September 2005

Size: px

Start display at page:

Download "SPECOM 1505 No. of Pages 33, Model 3+ ARTICLE IN PRESS. Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September 2005"

Cordelia Anthony
6 years ago
Views:

1 Speech Communication xxx (2005) xxx xxx 2 Recognizing student emotions and attitudes on the basis 3 of utterances in spoken tutoring dialogues with 4 both human and computer tutors 5 Diane J. Litman a, *, Kate Forbes-Riley b 6 a University of Pittsburgh, Department of Computer Science and Learning Research and Development Center, Pittsburgh, PA 15260, USA 7 b University of Pittsburgh, Learning Research and Development Center, Pittsburgh, PA 15260, USA 10 Abstract Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September While human tutors respond to both what a student says and to how the student says it, most tutorial dialogue systems 12 cannot detect the student emotions and attitudes underlying an utterance. We present an empirical study investigating the 13 feasibility of recognizing student state in two corpora of spoken tutoring dialogues, one with a human tutor, and one with 14 a computer tutor. We first annotate student turns for negative, neutral and positive student states in both corpora. We then 15 automatically extract acoustic prosodic features from the student speech, and lexical items from the transcribed or recog- 16 nized speech. We compare the results of machine learning experiments using these features alone, in combination, and with 17 student and task dependent features, to predict student states. We also compare our results across human human and 18 human computer spoken tutoring dialogues. Our results show significant improvements in prediction accuracy over rel- 19 evant baselines, and provide a first step towards enhancing our intelligent tutoring spoken dialogue system to automati- 20 cally recognize and adapt to student states. 21 Ó 2005 Elsevier B.V. All rights reserved. 22 Keywords: Emotional speech; Predicting user state via machine learning; Prosody; Empirical study relevant to adaptive spoken dialogue 3 systems; Tutorial dialogue systems Introduction 26 This paper investigates the automatic recognition 27 of student emotions and attitudes in both human 28 human and human computer spoken tutoring dia- logues, on the basis of acoustic prosodic and lexical information extractable from utterances. In recent years, the development of computational tutorial dialogue systems has become more and more prevalent (Aleven and Rose, 2003; Rose and Freedman, 2000; Rose and Aleven, 2002), as one method of attempting to close the current performance gap between human and computer tutors; recent experiments with such systems (e.g., Graesser et al., 2001b) are starting to yield promising empirical * Corresponding author. addresses: litman@cs.pitt.edu (D.J. Litman), forbesk@pitt.edu (K. Forbes-Riley) /$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi: /j.specom

2 2 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 39 results. Motivated by connections between learning 40 and student emotional state (Coles, 1999; Izard, ; Masters et al., 1979; Nasby and Yando, ; Potts et al., 1986; Seipp, 1991), another pro- 43 posed method for closing the performance gap with 44 human tutors has been to incorporate affective rea- 45 soning into computer tutoring systems, indepen- 46 dently of whether or not the tutor is dialogue- 47 based (Conati et al., 2003a; Kort et al., 2001; Bhatt 48 et al., 2004). Recently, some preliminary results with 49 computer tutors have been presented to support this 50 line of research. Aist et al. (2002) have shown that 51 adding human-provided emotional scaffolding to 52 an automated reading tutor increases student persis- 53 tence, while Craig and Graesser (2003) have found a 54 significant relationship between studentsõ confusion 55 and learning during interactions with a mixed initia- 56 tive dialogue tutoring system. 1 Our long-term goal 57 is to merge these lines of dialogue and affective 58 tutoring research, by enhancing our intelligent 59 tutoring spoken dialogue system to automatically 60 recognize and adapt to student emotions and atti- 61 tudes, and to investigate whether this improves 62 learning and other measures of performance. The 63 development of the adaptation component requires 64 accurate emotion recognition; this paper presents 65 results regarding this first step of our larger agenda: 66 building an emotion recognition component. 67 Currently, most intelligent tutoring dialogue sys- 68 tems do not attempt to recognize student emotions 69 and attitudes, and furthermore are text-based (Ale- 70 ven et al., 2001; Evens et al., 2001; VanLehn et al., ; Zinn et al., 2002), which may limit their suc- 72 cess at emotion prediction. Speech supplies a rich 73 source of information about a speakerõs emotional 74 state, and research in the area of emotional speech 75 has already shown that acoustic and prosodic fea- 76 tures can be extracted from the speech signal and 77 used to develop predictive models of emotions 78 (Cowie et al., 2001; ten Bosch, 2003; Pantic and 79 Rothkrantz, 2003; Scherer, 2003). Much of this re- 80 search has used databases of speech read by actors 81 or native speakers as training data (often with 82 semantically neutral content) (Oudeyer, 2002; Pol- 83 zin and Waibel, 1998; Liscombe et al., 2003). 84 Although analyses of the acoustic prosodic fea- tures associated with acted archetypal emotions support some correlations between specific features and emotions (e.g., lower average pitch and speaking rate for sad speech (ten Bosch, 2003)), these results generally transfer poorly to real applications (Cowie and Cornelius, 2003; Batliner et al., 2003). As a result, recent work motivated by spoken dialogue applications has started to use naturally occurring speech to train emotion predictors (Shafran et al., 2003; Batliner et al., 2003; Narayanan, 2002; Ang et al., 2002; Lee et al., 2002; Litman et al., 2001; Batliner et al., 2000; Lee et al., 2001; Devillers et al., 2003). However, within emotion research using naturally occurring data, both the range of emotions presented and the features that correlate with them have varied depending on the application domain (cf. Shafran et al., 2003; Narayanan, 2002; Ang et al., 2002; Lee et al., 2002; Batliner et al., 2000; Devillers et al., 2003). Thus, more empirical work is needed to explore whether and how the use of similar techniques can be effectively used to model student states in spoken dialogue tutoring systems. In addition, past research using naturally occurring speech has studied only human human (Devillers et al., 2003), human computer (Shafran et al., 2003; Lee et al., 2001; Lee et al., 2002; Narayanan, 2002; Ang et al., 2002), or wizard-of-oz (Batliner et al., 2000; Batliner et al., 2003; Narayanan, 2002) dialogue data. Just as previous work has demonstrated that results based on acted or read speech transfer poorly to spontaneous speech, more empirical work is needed to explore whether and how results regarding emotion prediction transfer across different types of naturally occurring spoken dialogue data, i.e. spoken dialogues between humans versus spoken dialogues between humans and computers, and/or spoken dialogues from different application domains. In this paper, we examine the relative utility of the acoustic prosodic and lexical information in student utterances, both with and without student and task dependent information, for recognizing student emotions and attitudes in spoken tutoring dialogues; we also examine the impact of using human transcriptions versus noisier system output for obtaining such information. Our methodology builds on and generalizes the results of prior research from the area of spoken dialogue, while applying them to the new domain of naturally occurring tutoring dialogues (in the domain of qualitative physics). Our work is also novel in replicating 1 We have also found a correlation between the ratio of negative/neutral student states and learning gains in our intelligent tutoring spoken dialogue data (to be described below), although these results are very preliminary

3 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx our analyses across two comparable spoken dia- 138 logue corpora: one with a computer tutor, and the 139 other with a human tutor performing the same task 140 as our computer system. Although these corpora 141 were collected under comparable experimental con- 142 ditions, they differ with respect to many characteris- 143 tics, such as utterance length and student initiative. 144 Given the current limitations of both speech and 145 natural language processing technologies, computer 146 tutors are far less flexible than human tutors, and 147 also make more errors. The use of human tutors 148 thus represents an ideal computer system, and 149 thereby provides a benchmark for estimating the 150 performance of our emotion recognition methods, 151 at least with respect to speech and natural language 152 processing performance. 153 In our experiments, we first annotate student 154 turns in both of our spoken dialogue tutoring cor- 155 pora for negative, neutral, and positive emotions 156 and attitudes. We then create two datasets for each 157 corpus: an Agreed dataset containing only those 158 student turns whose annotations were originally 159 agreed on by the annotators, and a Consensus 160 dataset containing all annotated student turns, 161 where original disagreements were given a consen- 162 sus label. These datasets are summarized in Table We then automatically extract acoustic prosodic 164 features from the speech signal of our annotated 165 student turns, and lexical items from the transcribed 166 or recognized speech, and perform a variety of ma- 167 chine learning experiments to predict our emotion 168 categorizations using different feature set combina- 169 tions. Overall, our results show that by using acous- 170 tic prosodic features alone, or in combination with 171 identifier features identifying specific subjects and 172 tutoring problems, or in combination with lexical 173 information, we can significantly improve over 174 baseline (majority class) performance figures for 175 emotion prediction. Our highest prediction accura- 176 cies are obtained by combining multiple feature 177 types and by predicting only those annotated stu- 178 dent turns that both annotators agreed on. Table summarizes these results in our human human 180 corpus, and Table 19 summarize these results in 181 our human computer corpus. However, simpler 182 models containing only a subset of features (or fea- 183 ture types) work comparably in many experiments, 184 and these simpler models often have the advantage 185 in terms of ease of implementation and/or do- 186 main-independence. While many of our observa- 187 tions generalized across the human human and human computer dialogues, we also find interesting differences between recognizing emotion in our two corpora, and also as compared to prior studies in other domains. In general, lexical features yielded higher predictive utility than acoustic prosodic features. Within acoustic prosodic features, there was a trend for temporal features to have the highest predictive utility, followed by energy features and lastly, pitch features. However, the usefulness of acoustic prosodic features varied across experiments and corpora; indeed, across prior research as a whole, the usefulness of particular acoustic prosodic features appears to be often domaindependent. Similarly, identifier features, whose use is limited to domains such as ours where there is a limited problem set and students reuse the tutoring system repeatedly, were found to have higher predictive utility in our human computer corpus as compared to our human human corpus. In sum, our recognition results provide an empirical basis for the next phase of our research, which will be to enhance our spoken dialogue tutoring system to automatically recognize and ultimately to adapt to student states. Section 2 describes ITSPOKE, our intelligent tutoring spoken dialogue system and the corpus it produces, as well as a human human spoken tutoring corpus that corresponds to the human computer corpus produced by ITSPOKE. Section 3 describes our annotation scheme for manually labeling student emotions and attitudes, and evaluates inter-annotator agreement when this scheme is used to annotate student states in dialogues from both our human human and human computer corpora. Section 4 discusses how acoustic and prosodic features available in real-time to ITSPOKE are computed from our dialogues. Section 5 then presents our machine learning experiments in automatic emotion recognition, analyzing the predictive performance of acoustic prosodic features alone or in combination, both with and without subject and task-dependent information. Section 6 investigates the impact of both adding a lexical feature representing the transcription of the student turn, and for the human computer dialogues, using the noisy output of the speech recognizer rather than the actual transcription. Finally, Section 7 discusses related research, while Section 8 summarizes our results and describes our current and future directions

4 4 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx Spoken dialogue tutoring corpora Common aspects of the corpora 240 Our data for this paper come from spoken inter- 241 actions between student and tutor, through which 242 students learn to solve qualitative physics problems, 243 i.e. thought-provoking explain or why type 244 physics problems that can be answered without 245 doing any mathematics. We have collected two cor- 246 pora of these spoken tutoring dialogues, which are 247 distinguished according to whether the tutor is a hu- 248 man or a computer. 249 In these spoken tutoring corpora, dialogue inter- 250 action between student and tutor is mediated via a 251 web interface, supplemented with a high-quality 252 audio link. An example screenshot of this web inter- 253 face, generated during an interaction between a stu- 254 dent and the computer tutor, is shown in Fig. 1. The 255 qualitative physics problem (problem 58) is shown in 256 the upper right box. The student begins by typing an 257 essay answer to this problem in the middle right box. 258 When finished with the essay, the student clicks the 259 SUBMIT button. 2 The tutor then analyzes the es- 260 say and engages the student in a spoken natural lan- 261 guage dialogue to provide feedback and correct 262 misconceptions in the essay, and to elicit more com- 263 plete explanations. The middle left box in Fig. 1 is 264 used during human computer tutoring to record 265 the dialogue history. This box remains empty during 266 human human tutoring, because both the student 267 and tutor utterances would require manual tran- 268 scription before they could be displayed. In the hu- 269 man computer tutoring, in contrast, the speech 270 recognition and speech synthesis components of 271 the computer tutor can be used to provide the tran- 272 scriptions. After the dialogue between tutor and 273 student is completed, the student revises the essay, 274 thereby ending the tutoring for that physics problem 275 or causing another round of tutoring/essay revision. 276 The experimental procedure for collecting both 277 our spoken tutoring corpora is as follows 3 : (1) stu- 278 dents are given a pre-test measuring their knowledge 279 of physics, (2) students are asked to read through a small document of background material, 4 (3) students use the web and voice interface to work through a set of up to 10 training physics problems with the (human or computer) tutor, and (4) students are given a post-test that is similar to the pre-test. The experiment typically takes no more than 7 h per student, and is performed in 1 2 sessions. Students are University of Pittsburgh students who have never taken a college level physics course, and who are native speakers of American English The human human spoken dialogue tutoring corpus Our human human spoken dialogue tutoring corpus contains 128 transcribed dialogues (physics problems) from 14 different students, collected from Fall 2002 Fall One human tutor participated. The student and the human tutor were separated by a partition, and spoke to each other through headmounted microphones. Each participantõs speech was digitally recorded on a separate channel. Transcription and turn-segmentation of the student and tutor speech were then done by a paid transcriber. The transcriber added a turn boundary when: (1) the speaker stopped speaking and the other party in the dialogue began to speak, (2) the speaker asked a question and stopped speaking to wait for an answer, (3) the other party in the dialogue interrupted the speaker and the speaker paused to allow the other party to speak. An emotion-annotated (Section 3) excerpt from our human human tutoring corpus is shown in Fig. 2. In the human human corpus, interruptions and overlapping speech are common; turns ending in - (as in TUTOR 6, Fig. 2) indicate when speech overlaps with the following turn, and other punctuation has been added to the transcriptions for readability The human computer spoken dialogue tutoring corpus Our human computer spoken dialogue tutoring corpus contains 100 dialogues (physics problems) from 20 students, collected from Fall 2003 Spring The Tell Tutor box is used for typed student login and logout. 3 Our spoken tutoring corpora were collected as part of a wider evaluation comparing student learning across speech-based and text-based human human and human computer tutoring conditions (Litman et al., 2004). 4 In the computer tutoring experiment, the pre-test was moved to after the background reading, to allow us to measure learning gains caused by the experimental manipulation without confusing them with gains caused by the background reading.

5 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 5 Fig. 1. Screenshot during human computer spoken tutoring dialogue. Fig. 2. Annotated excerpt from human human spoken tutoring corpus Our computer tutor is called ITSPOKE 323 (Intelligent Tutoring SPOKEn dialogue system) (Litman and Silliman, 2004). ITSPOKE uses as its back-end the text-based Why2-Atlas dialogue

6 6 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 326 tutoring system (VanLehn et al., 2002), which han- 327 dles syntactic and semantic analysis (Rosé, 2000), 328 discourse and domain level processing (Jordan and 329 VanLehn, 2002; Jordan et al., 2003), and finite-state 330 dialogue management (Rosé et al., 2001). 331 To analyze the typed student essay, the Why2-332 Atlas back-end first parses the student essay into 333 propositional representations, in order to find useful 334 dialogue topics. It uses three different approaches 335 (symbolic, statistical and hybrid) competitively to 336 create a representation for each sentence, then re- 337 solves temporal and nominal anaphora and con- 338 structs proofs using abductive reasoning (Jordan 339 et al., 2004). 340 During the subsequent dialogue, student speech 341 is digitally recorded from head-mounted micro- 342 phone input. Barge-ins and overlaps are not cur- 343 rently permitted. 5 The student speech is sent to the Fig. 3. Annotated excerpt from human computer spoken tutoring corpus. Sphinx2 speech recognizer (Huang et al., 1993), whose stochastic language models have a vocabulary of 1240 words and are trained with 7720 student utterances from evaluations of Why2-Atlas and from pilot studies of ITSPOKE. Transcription (speech recognition) and turn-segmentation is done automatically in ITSPOKE. However, because speech recognition is imperfect, the human computer data is also manually transcribed, for comparison. Sphinx2Õs most probable transcription (recognition output) is sent to the Why2-Atlas back-end for natural language understanding. The dialogue is managed by a finite-state dialogue manager, where nodes correspond to tutor turns, and arcs to student turns. Why2-AtlasÕ natural language understanding (NLU) component associates a semantic grammar with each tutor question (i.e., with each node in the dialogue finite-state machine); grammars across questions may share rules. The categories in the grammar correspond to the expected responses for the question (i.e., to the arcs exiting the question node in the finite-state machine), and represent both correct answers and typical student misconceptions (VanLehn et al., 2002). Given a stu- 5 Although not yet evaluated, our next version of ITSPOKE supports barge-in, and thus allows the student to interrupt ITSPOKE when it is speaking, e.g., when it is giving a long explanation

7 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx dentõs utterance, the output of the NLU component 369 is thus a subset of the semantic concepts that were 370 expected as answers to the tutorõs prior question, 371 and that were found when parsing the studentõs 372 utterance. For instance, the semantic concept down- 373 ward is used in many of the semantic grammars, and 374 would be the semantic output for a variety of utter- 375 ances such as downwards, towards earth, is it 376 downwards, down, etc. 377 The text response produced by Why2-Atlas (i.e., 378 the next node in the finite-state machine) is then sent 379 to the Cepstral text-to-speech system 6 and played to 380 the student through the headphone. After each system 381 prompt or student utterance is spoken, the system 382 prompt, or the systemõs understanding of the studentõs 383 response (i.e., the output of the speech recognizer), 384 respectively, are added to the dialogue history. At 385 the time the screenshot in Fig. 1 was generated, for 386 example, the student had just said free fall (in this 387 case the utterance was correctly recognized). 388 An emotion-annotated (Section 3) dialogue ex- 389 cerpt from our human computer corpus is shown 390 in Fig. 3. The excerpt shows both what the student 391 said and what ITSPOKE recognized (the ASR 392 annotations). As shown, the output of the auto- 393 matic speech recognizer sometimes differed from 394 what the student actually said. When ITSPOKE 395 was not confident of what it thought the student 396 said, it generated a rejection prompt and asked the 397 student to repeat. On average, ITSPOKE produced rejection prompts per dialogue. ITSPOKE also 399 misrecognized utterances; when ITSPOKE heard 400 something different than what the student said (as 401 with the last student turn) but was confident in its 402 hypothesis, it proceeded as if it heard correctly. 403 While the ITSPOKE word error rate in this corpus 404 was 31.2%, natural language understanding based 405 on speech recognition (i.e., the recognition of 406 semantic concepts instead of actual words) is the 407 same as based on perfect transcription 92.4% of 408 the time. 7 The accuracy of recognizing semantic 409 concepts is more relevant for dialogue evaluation, 410 as it does not penalize for word errors that are 411 unimportant to overall utterance interpretation. 3. An annotation scheme for student emotion and attitude 3.1. Emotion classes In our data, student emotions 8 can only be identified indirectly: via what is said and/or how it is said. However, such evidence is not always obvious, unambiguous, or consistent. For example, a student may express anger through the use of swear words, or through a particular tone of voice, or via a combination of signals, or not at all. Moreover, another student may present some of these same signals even when s/he does not feel anger. In (Litman and Forbes-Riley, 2004a), we present a coding scheme for manually annotating the student turns 9 in our spoken tutoring dialogues for intuitively perceived expression of emotion. In this scheme, expressions of emotion 10 are viewed along a linear scale, shown and defined as follows 11 : negative neutral! positive Negative 12 : a student turn that expresses emotions such as confused, bored, irritated, uncertain, sad. Examples of negative student turns in our human human and human computer corpora are found 6 The Cepstral system is a commercial outgrowth of the Festival system (Black and Taylor, 1997). 7 An internal evaluation of this semantic analysis component in an early version of the Why2-Atlas system (with its typed input, and thus perfect transcription) yielded 97% accuracy (Rose, 2005). 8 In the rest of this paper, we will use the term emotion loosely, to cover both affects and attitudes that can impact student learning. Although some argue that emotion should be distinguished from attitude, some speech researchers have found that the narrow sense of emotion is too restrictive because it excludes states in speech where emotion is present but not full-blown, including arousal and attitude (Cowie and Cornelius, 2003). Some tutoring researchers have also found it useful to take a combined view of affect and attitude (Bhatt et al., 2004). 9 We use the terms turn and utterance interchangeably in this paper. 10 Although an expression of emotion is not interchangeable with the emotion itself (Russell et al., 2003), our use of the term emotion hereafter should be understood as referring (when appropriate) to the annotated expression of emotion. 11 In (Litman and Forbes-Riley, 2004a), we have also explored separately annotating strong, weak and mixed emotions, as well as annotating specific emotions such as uncertain, irritated, confident; complete details of our annotation studies are described therein. 12 These negative, neutral and positive emotion classes correspond to traditional notions of valence (cf. Cowie and Cornelius, 2003), but these terms are not related to the impact of emotion on learning. For example, in work that draws on a disequilibrium theory of the relationship between emotion and learning, working through negative emotions is believed to be a necessary part of the learning process (Craig and Graesser, 2003)

8 8 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 436 in Figs. 2 and 3. Evidence 13 of a negative emotion 437 can come from the lexical expressions of uncer- 438 tainty, e.g., the phrase I donõt know, a syntactic 439 question, disfluencies, as well as acoustic and pro- 440 sodic features, including pausing, pitch and energy 441 variation. For example, the negative student turn, 442 student 5, in Fig. 2, contains the phrase I donõt 443 know why, as well as frequent internal pausing 444 and a wide pitch variation. 14 The negative student 445 turn, student 19, in Fig. 3, displays a slow tempo 446 and rising intonation. 447 Positive: a student turn that expresses emotions 448 such as confident, enthusiastic. For example, student in Fig. 2 is labeled positive. Evidence of a positive 450 emotion in this case comes from lexical expressions 451 of certainty, e.g., ItÕs the..., as well as acoustic 452 and prosodic features, including loud speech and a 453 fast tempo. The positive student turn, student 21,in 454 Fig. 3, displays a fast tempo with very little pausing 455 preceding the utterance. 456 Neutral: a student turn that does not express a po- 457 sitive or negative emotion. Examples of neutral stu- 458 dent turns are student 8 in Fig. 2 and student 22 in 459 Fig. 3. Acoustic and prosodic features, including 460 moderate loudness, tempo, and inflection, give evi- 461 dence for these neutral labels, as does the lack of 462 semantic content in the grounding phrase, mm-hm. 463 Emotion annotations were performed from both 464 audio and transcription using the sound visualiza- 465 tion and manipulation tool, Wavesurfer. 15 The emo- 466 tion annotators were instructed to try to annotate 467 emotion relative to both context and task. By con- 468 text-relative we mean that a student turn in our 469 tutoring dialogues is identified as expressing emo- 470 tion relative to the other student turns in that dia- 471 logue. By task-relative we mean that a student 472 turn perceived during tutoring as expressing an 473 emotion might not be perceived as expressing the 474 same emotion with the same strength in another 475 (e.g., non-tutoring) situation. Moreover, the range 476 of emotions that arise during tutoring might not 477 be the same as the range of emotions that arise dur- ing some other task. For example, consider the context of a tutoring session, where a student has been answering tutor questions with apparent ease. If the tutor then asks another question, and the student responds slowly, saying Um, now IÕm confused, this turn would likely be labeled negative. However, in the context of a heated argument between two people, this same turn might be labeled as a weak negative, or even weak positive. Litman and Forbes-Riley (2004a) provides full details of our annotation scheme, including discussion of our coding manual and annotation tool, while Section 7 compares our scheme to related work Quantifying inter-annotator agreement We conducted a study for each corpus, to quantify the degree of agreement among two coders (the authors) in classifying utterances using our annotation scheme. To analyze agreement in our human human spoken tutoring corpus (Section 2.2), we randomly selected 10 transcribed dialogues from 9 subjects, yielding a dataset of 453 student turns, where approximately 40 turns came from each of the 9 subjects. The 453 turns were separately annotated by the two authors, using the emotion annotation scheme described above. To analyze agreement in our human computer corpus (Section 2.3), we randomly selected 15 transcribed dialogues from 10 subjects, yielding a dataset of 333 student turns, where approximately 30 turns came from each of 10 subjects. Each turn was again separately annotated by the two authors. Two confusion matrices summarizing the resulting agreement between the two emotion annotators for each corpus are shown in Tables 1 and 2. The rows correspond to the labels assigned by annotator 1, and the columns correspond to the labels assigned 13 As determined by post-annotation discussion (see Section 7). 14 As illustrated by the hyperlinks in Figs. 2 and 3, annotators could also listen to the recording of the dialogue, as detailed below. If your pdf reader does not support hyperlinks, you can listen to these dialogue excerpts at this website: ( The tool is shown in Figs. 5 and 6. Table 1 Confusion matrix for human human corpus annotation Negative Neutral Positive Negative Neutral Positive Table 2 Confusion matrix for human computer corpus annotation Negative Neutral Positive Negative Neutral Positive

9 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx by annotator 2. For example, in Table 1, 112 nega- 515 tives were agreed upon by both annotators, while of the negatives assigned by annotator 1 were la- 517 beled as neutral by annotator 2, and 9 of the nega- 518 tives assigned by annotator 1 were labeled as 519 positive by annotator 2. Note that across both cor- 520 pora, annotator 2 consistently annotates more posi- 521 tive and less neutral turns than annotator As shown along the diagonal in Table 1, the two 523 annotators agreed on the annotations of 340/ student turns on the human human tutoring data, 525 achieving 75.1% agreement (Kappa = 0.6, a = ). 16 As shown along the diagonal in Table 2, the 527 two annotators agreed on the annotations of 202/ student turns in the human computer tutoring 529 data, achieving 60.7% agreement (Kappa = 0.4, 530 a = 0.4). 17 It has generally been found to be difficult 531 to achieve levels of inter-annotator agreement above 16 Kappa and a are metrics for computing the pairwise agreement among annotators making category judgments. Kappa (Carletta, 1996; Siegel et al., 1988; Cohen, 1960) is computed as: PðAÞ PðEÞ 1 PðEÞ, where P(A) is the proportion of actual agreement among annotators, and P(E) is the proportion of agreement expected by chance. a (Krippendorf, 1980) is computed as: 1 Dð0Þ DðEÞ, where D(O) is the proportion of observed disagreement between annotators and D(E) is the proportion of disagreement expected by chance. When there is no agreement other than that expected by chance, Kappa and a = 0. When there is total agreement, Kappa and a =1.KrippendorfÕs (1980) a and Siegel et al.õs (1988) version of Kappa are nearly identical; however, these two metrics use a different method of estimating the probability distribution for chance than does CohenÕs (1960) version of Kappa (DiEugenio and Glass, 2004), which is used in this paper. Although interpreting the strength of inter-annotator agreement is controversial (DiEugenio and Glass, 2004), Landis and Koch (1977) and others use the following standard for Kappa: , Fair ; , Moderate ; , Substantial ; , Almost Perfect. Krippendorf (1980) uses the following stricter standard for a: a <.67, cannot draw conclusions ;.67 < a >.8, allows tentative conclusions ; a >.8, allows definite conclusions. Although neither metric is ideal for this study because they assume independent events, unlike other measures of agreement such as percent agreement, Kappa and a take into account the inherent complexity of a task by correcting for chance expected agreement. 17 Since our emotion categories are ordinal/interval rather than nominal, we can also quantify agreement using a weighted version of Kappa (Cohen, 1968), which accounts for the relative distances between successive categories. With (quadratic) weighting, our Kappa values increase to.7 and.5 for the human human and human computer annotations, respectively. Similarly, using an interval version of a (Krippendorf, 1980) that also accounts for a relative distance (of 1) between categories, a values increase to.7 and.5 for the human human and human computer annotations, respectively. Table 3 Consensus labeling over emotion-annotated data Negative Neutral Positive Human human Human computer agreed data and our consensus-labeled data. Moderate (see footnote 16) for emotion annota- 532 tion in naturally occurring dialogues. Ang et al. 533 (2002), for example, report inter-annotator agree- 534 ment of 71% (Kappa 0.47), while Shafran et al. 535 (2003) report Kappas ranging between 0.32 and Such studies were nevertheless able to use 537 acoustic prosodic cues to effectively distinguish 538 these annotator judgments of emotion. 539 A number of researchers have accommodated for 540 low inter-annotator agreement for emotion annota- 541 tion by exploring ways of achieving consensus be- 542 tween disagreed annotations. Following Ang et al. 543 (2002) and Devillers et al. (2003), we explored con- 544 sensus labeling, both with the goal of increasing 545 our usable dataset for prediction, and to include 546 the more difficult annotation cases. For our consen- 547 sus labeling, the original annotators revisited each 548 originally disagreed case, and through discussion, 549 sought a consensus label. Due to consensus labeling, 550 agreement rose in both our human human and hu- 551 man computer data to 100%. 18 A summary of the 552 distribution of emotion labels after consensus label- 553 ing is shown in Table As in (Ang et al., 2002), we will experiment with 555 predicting emotions in Section 5 using both our Table 4 summarizes the characteristics of the emo- 558 tion-annotated subsets of both our human and com- 559 puter tutoring corpora, with respect to both the 560 agreed and consensus emotion labels. 561 As a final note, during the annotation and subse- 562 quent consensus discussions, we observed that the 563 human human and human computer dialogues dif- 564 fer with respect to a variety of characteristics. Many 565 of these differences are illustrated in the corpus There were eight student turns in the human human corpus for which the annotators had difficulty deciding upon a consensus label; these cases were given the neutral consensus label as a result. 19 Although not discussed in this paper, we have also run prediction experiments using each individual annotatorõs labeled data; the results in each case were lower than those for the agreed data, and were approximately the same as the results for the consensus-labeled data, as discussed below.

10 10 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx Table 4 Summary of emotion-annotated data Human human Human computer Agreed Consensus Agreed Consensus # students # dialogues # student turns # student words # unique student words Minutes student speech Majority class (neutral) 53% 60% 47% 48% 567 excerpts above, and in part reflect the fact that our 568 computer tutor is far less robust than our human tu- 569 tor with respect to its interactiveness and under- 570 standing capabilities. Such differences can 571 potentially impact both the emotional state of the 572 student, and how the student is able to express an 573 emotional state. We hypothesize that such differ- 574 ences may have also impacted the comparative dif- 575 ficulty in annotating emotion in the two corpora. 576 For example, the average student turn length in 577 the 10 annotated human human dialogues is words, while for the 15 human computer dialogues 579 the average turn length is 2.52 words. The fact that 580 students speak less in the human computer dia- 581 logues means that there is less information to make 582 use of when judging expressed emotions. We also 583 observed that in the human human dialogues, there 584 are more student initiatives and groundings as well 585 as references to prior problems. The limitations of 586 the computer tutor may have thus restricted how 587 students expressed themselves (including how they 588 expressed their own emotional states) in other ways besides word quantity. Finally, the fact that the computer tutor made processing errors may have impacted both the types and quantity of student emotional states. As shown in Table 3, there is a higher proportion of negative emotions in the human computer corpus as compared to the human human corpus (38% versus 26%, respectively). As we will see with our machine learning experiments in Section 5, emotion prediction is also more difficult in the human computer corpus, which may again in part reflect its differing dialogue characteristics that arise from the limitations of the computer tutor. 4. Extracting features from the speech signal of student turns 4.1. Acoustic prosodic features For each of the emotion-annotated student turns in the human human and human computer corpora, we computed the 12 acoustic and prosodic features itemized in Fig. 4, for use in the machine Fig. 4. Twelve acoustic prosodic features per student turn.

11 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx learning experiments described in Section Moti- 609 vated by previous studies of emotion prediction in 610 spontaneous dialogues in other domains (Ang 611 et al., 2002; Lee et al., 2001; Batliner et al., 2003), 612 our acoustic prosodic features represent knowledge 613 of pitch, energy, duration, tempo and pausing. As- 614 pects of silence and pausing have been shown to 615 be relevant for categorizing other aspects of student 616 behavior in tutoring dialogues as well (Fox, 1993; 617 Shah et al., 2002). We focus on acoustic and pro- 618 sodic features of individual turns that can be com- 619 puted automatically from the speech signal and 620 are available in real-time to ITSPOKE, since our 621 long-term goal is to use these features to trigger on- 622 line adaptation in ITSPOKE based on predicted 623 student emotions. 624 F0 and RMS values, representing measures of 625 pitch excursion and loudness, respectively, were 626 computed using Entropic Research LaboratoryÕs 627 pitch tracker, get_f0, 21 with no post-correction. A 628 pitchtracker takes as input a speech file, and outputs 629 a fundamental frequency (f0) contour (the physical 630 correlate of pitch). In Fig. 5, for example, the Pitch 631 Pane displays the f0 contour for the experimentally 632 obtained speech file shown in the Student Speech 633 pane, where the x-axis represents time and the y-axis 634 represents frequency in Hz. 22 Each f0 value corre- 635 sponds to a frame step of 0.01 s across the student 636 turn free fall? ; the rising f0 contour is typical of 637 a question. Our features maxf0 and minf0 cor- 638 respond to the highest and lowest f0 values (the 639 peaks and valleys) in the f0 contour, while 640 meanf0 and stdf0 are based on averaging over 641 all the (non-zero) f0 values in the contour, which are 642 given by get_f0 in frame steps of 0.01 s. 643 Energy can alternatively be represented in terms 644 of decibels (db) or root mean squared amplitude 20 In preliminary experiments for this paper, and also in previous work (e.g., Litman and Forbes, 2003), we also investigated the use of two normalized versions of our acoustic prosodic features, specifically, features normalized by either prior turn or by first turn. These normalizations have the benefit of removing the gender dependency of f0 features. However, we have consistently found little difference in predictive utility for raw versus normalized features, in both our human human and human computer data, so use only raw (non-normalized) feature values here. As discussed below, however, we experiment with the use of gender as an explicit feature. 21 get_f0 and other Entropic software is currently available free of charge at 22 The representations in Figs. 5 and 6 use the Wavesurfer sound visualization and manipulation tool. of the actual student turn boundaries. (rms). For example, the Energy Pane at the bot- 645 tom of Fig. 5 displays energy values computed in 646 decibels across frame steps of 0.01 s for the student 647 speech shown in the Student Speech pane, where 648 the x-axis represents time and the y-axis represents 649 decibels. The variation in energy values across this 650 student turn reflects that the studentõs utterance it- 651 self is much louder than the silences before and 652 after (although as can be seen, the analysis picks 653 up some minor background noise when the student 654 is not speaking). The get_f0 pitch tracker used in 655 this study computes energy as rms values based on 656 a 0.03-s window within frame steps of 0.01 s. max- 657 rms and minrms correspond to the highest and 658 lowest rms values over all the frames in a student 659 turn, while meanrms and stdrms are based on 660 averaging over all the rms values in the frames in 661 a student turn. 662 Our four temporal features were computed from 663 the turn boundaries of the transcribed speech. Re- 664 call that during our corpora collection, student 665 and tutor speech are digitally recorded separately, 666 yielding a 2-channel speech file for each dialogue. 667 In Fig. 6, the Tutor Speech and Student Speech 668 panes show a portion of the tutor and student 669 speech files, while the Tutor Text and Student 670 Text show the associated transcriptions. The verti- 671 cal lines around each tutor and student utterance 672 correspond to the turn segmentations. For example, 673 the leftmost vertical line indicates that the tutorõs 674 turn what is that motion? begins at approxi- 675 mately s (30,450 ms) into the dialogue. Recall 676 that in our human human dialogues, these turn 677 boundaries are manually labeled by our paid tran- 678 scriber. In our human computer dialogues, tutor 679 turn boundaries correspond to the beginning and 680 end times of the speech synthesis process, while stu- 681 dent turn boundaries correspond to the beginning 682 and end times of the student speech as detected by 683 the speech recognizer, and thus are a noisy estimate The duration of each student turn was calculated 686 by subtracting the turnõs beginning from its ending 687 time. In Fig. 6, for example, the duration of the stu- 688 dentõs turn is approximately 0.90 s ( s) While we manually transcribed the lexical information to quantify the error due to speech recognition, we did not manually relabel turn boundaries; we thus can not quantify the level of noise introduced by automatic turn segmentation.

12 12 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 691 The preceding pause (prepause) before a student 692 turn began was calculated by subtracting the ending 693 time of the tutorõs (prior) turn from the beginning 694 time of the studentõs turn. In Fig. 6, for example 695 the duration of the pause preceding the studentõs 696 turn is 0.85 s ( s) The speaking rate (tempo) was calculated as syl- 698 lables per second in the turn (where the number of 699 syllables in the transcription was computed using 700 the Festival text-to-speech OALD dictionary, and 701 the turn duration computed as above). 25 For exam- 702 ple, in Fig. 6, there are five syllables in the student 703 turn ( the freefall motion ), and the duration of 704 the student turn is 0.90 s (as computed above), thus 705 the speaking rate in the turn is 5.56 syllables/second. 706 In this paper, we computed tempo in the human 707 computer dialogues based on the human transcrip- 708 tion of the student turns. Although this more closely 709 reflects the actual tempo rather than the noisier tem- 710 po computed on the automatic speech recognition 24 Note that in the human human corpus, if a student turn began before the prior tutor turn ended (i.e., student barge-ins and overlaps), the preceding pause feature for that turn was 0. If a student turn initiated a dialogue or was preceded by a student turn (rather than a tutor turn), its preceding pause feature was not defined for that turn. In the human computer corpus, every student turn is preceded by a tutor turn. 25 Note that this method calculates only a single (average) tempo of the turn, because we were not sampling the tempo at subintervals throughout the turn. Fig. 5. Computing pitch and energy-based features. output, Ang et al. (2002) compared machine learning experiments using features such as tempo computed both on the human transcription and on the automatically recognized speech, and found that the prediction results were comparable. Amount of silence (intsilence) was defined as the percentage of frames in the turn where the probability of voicing = 0; this probability is available from the output of the get_f0 pitch-tracker, and the resulting percentage represents roughly the percentage of time within the turn that the student was silent. 26 For example, the student turn in Fig. 6 has approximately 31% internal silence Adding Identifier features representing the student and problem Finally, we also recorded for each turn the 3 identifier features shown in Fig. 7, all of which are automatically available in ITSPOKE through student login. Prior studies (Oudeyer, 2002; Lee et al., 2002) have shown that subject and gen- 26 Using the percentage of unvoiced frames as a measure of silence will overestimate the amount of silence in the turn, because e.g., long unvoiced fricatives will be included; however it has been used in previous work as a rough estimate of internal silence (Litman et al., 2001). In our data, energy was rarely zero across the individual frames per turn, and thus was not a better estimate of internal silence

emotions differently. subject 734 ID and problem ID are uniquely important in 735 our tutoring

13 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx der features can play an important role in emotion 732 recognition, because different genders and different 733 speakers can convey emotions differently. subject 734 ID and problem ID are uniquely important in 735 our tutoring domain, because in contrast to e.g., call 736 centers, where most callers are distinct, students will 737 use our system repeatedly, and problems are re- 738 peated across students Predicting student emotion from 740 acoustic prosodic features 741 We next performed machine learning experi- 742 ments with acoustic prosodic features and our emo- 743 tion-annotated student turns, to explore how well Fig. 6. Computing temporal features. Fig. 7. Three identifier features per student turn. the 12 acoustic prosodic features discussed in Section 4 predict the emotion labels in both our human human and human computer tutoring corpora. We explore the predictions for our originally agreed emotion labels and our consensus emotion labels (Section 3.2). Using originally agreed data is expected to produce better results, since presumably annotators originally agreed on cases that provide more clear-cut prosodic information about emotional features (Ang et al., 2002), but using consensus data is worthwhile because it includes the less clear-cut data that the computer will actually encounter. For these experiments, we use a boosting algorithm in the Weka machine learning software (Witten and Frank, 1999). In general, the boosting algorithm, called AdaBoostM1 in Weka, enables the accuracy of a weak learning algorithm to be improved by repeatedly applying that algorithm to different distributions or weightings of training examples, each time generating a new weak prediction rule, and eventually combining all weak predic- 27 In preliminary experiments for this paper, we examined the use of only subject ID as well as only subject ID and gender, with the view that these identifier features generalized to other domains besides physics. Overall we found that these two subsets produced results that were the same as including the problem ID feature

Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S *

Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S * Amruta Purandare and Diane Litman Intelligent Systems Program University of Pittsburgh amruta,litman @cs.pitt.edu Abstract