SPECOM 1505 No. of Pages 33, Model 3+ ARTICLE IN PRESS. Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September 2005

Size: px
Start display at page:

Download "SPECOM 1505 No. of Pages 33, Model 3+ ARTICLE IN PRESS. Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September 2005"

Transcription

1 Speech Communication xxx (2005) xxx xxx 2 Recognizing student emotions and attitudes on the basis 3 of utterances in spoken tutoring dialogues with 4 both human and computer tutors 5 Diane J. Litman a, *, Kate Forbes-Riley b 6 a University of Pittsburgh, Department of Computer Science and Learning Research and Development Center, Pittsburgh, PA 15260, USA 7 b University of Pittsburgh, Learning Research and Development Center, Pittsburgh, PA 15260, USA 10 Abstract Received 27 July 2004; received in revised form 13 September 2005; accepted 21 September While human tutors respond to both what a student says and to how the student says it, most tutorial dialogue systems 12 cannot detect the student emotions and attitudes underlying an utterance. We present an empirical study investigating the 13 feasibility of recognizing student state in two corpora of spoken tutoring dialogues, one with a human tutor, and one with 14 a computer tutor. We first annotate student turns for negative, neutral and positive student states in both corpora. We then 15 automatically extract acoustic prosodic features from the student speech, and lexical items from the transcribed or recog- 16 nized speech. We compare the results of machine learning experiments using these features alone, in combination, and with 17 student and task dependent features, to predict student states. We also compare our results across human human and 18 human computer spoken tutoring dialogues. Our results show significant improvements in prediction accuracy over rel- 19 evant baselines, and provide a first step towards enhancing our intelligent tutoring spoken dialogue system to automati- 20 cally recognize and adapt to student states. 21 Ó 2005 Elsevier B.V. All rights reserved. 22 Keywords: Emotional speech; Predicting user state via machine learning; Prosody; Empirical study relevant to adaptive spoken dialogue 3 systems; Tutorial dialogue systems Introduction 26 This paper investigates the automatic recognition 27 of student emotions and attitudes in both human 28 human and human computer spoken tutoring dia- logues, on the basis of acoustic prosodic and lexical information extractable from utterances. In recent years, the development of computational tutorial dialogue systems has become more and more prevalent (Aleven and Rose, 2003; Rose and Freedman, 2000; Rose and Aleven, 2002), as one method of attempting to close the current performance gap between human and computer tutors; recent experiments with such systems (e.g., Graesser et al., 2001b) are starting to yield promising empirical * Corresponding author. addresses: litman@cs.pitt.edu (D.J. Litman), forbesk@pitt.edu (K. Forbes-Riley) /$ - see front matter Ó 2005 Elsevier B.V. All rights reserved. doi: /j.specom

2 2 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 39 results. Motivated by connections between learning 40 and student emotional state (Coles, 1999; Izard, ; Masters et al., 1979; Nasby and Yando, ; Potts et al., 1986; Seipp, 1991), another pro- 43 posed method for closing the performance gap with 44 human tutors has been to incorporate affective rea- 45 soning into computer tutoring systems, indepen- 46 dently of whether or not the tutor is dialogue- 47 based (Conati et al., 2003a; Kort et al., 2001; Bhatt 48 et al., 2004). Recently, some preliminary results with 49 computer tutors have been presented to support this 50 line of research. Aist et al. (2002) have shown that 51 adding human-provided emotional scaffolding to 52 an automated reading tutor increases student persis- 53 tence, while Craig and Graesser (2003) have found a 54 significant relationship between studentsõ confusion 55 and learning during interactions with a mixed initia- 56 tive dialogue tutoring system. 1 Our long-term goal 57 is to merge these lines of dialogue and affective 58 tutoring research, by enhancing our intelligent 59 tutoring spoken dialogue system to automatically 60 recognize and adapt to student emotions and atti- 61 tudes, and to investigate whether this improves 62 learning and other measures of performance. The 63 development of the adaptation component requires 64 accurate emotion recognition; this paper presents 65 results regarding this first step of our larger agenda: 66 building an emotion recognition component. 67 Currently, most intelligent tutoring dialogue sys- 68 tems do not attempt to recognize student emotions 69 and attitudes, and furthermore are text-based (Ale- 70 ven et al., 2001; Evens et al., 2001; VanLehn et al., ; Zinn et al., 2002), which may limit their suc- 72 cess at emotion prediction. Speech supplies a rich 73 source of information about a speakerõs emotional 74 state, and research in the area of emotional speech 75 has already shown that acoustic and prosodic fea- 76 tures can be extracted from the speech signal and 77 used to develop predictive models of emotions 78 (Cowie et al., 2001; ten Bosch, 2003; Pantic and 79 Rothkrantz, 2003; Scherer, 2003). Much of this re- 80 search has used databases of speech read by actors 81 or native speakers as training data (often with 82 semantically neutral content) (Oudeyer, 2002; Pol- 83 zin and Waibel, 1998; Liscombe et al., 2003). 84 Although analyses of the acoustic prosodic fea- tures associated with acted archetypal emotions support some correlations between specific features and emotions (e.g., lower average pitch and speaking rate for sad speech (ten Bosch, 2003)), these results generally transfer poorly to real applications (Cowie and Cornelius, 2003; Batliner et al., 2003). As a result, recent work motivated by spoken dialogue applications has started to use naturally occurring speech to train emotion predictors (Shafran et al., 2003; Batliner et al., 2003; Narayanan, 2002; Ang et al., 2002; Lee et al., 2002; Litman et al., 2001; Batliner et al., 2000; Lee et al., 2001; Devillers et al., 2003). However, within emotion research using naturally occurring data, both the range of emotions presented and the features that correlate with them have varied depending on the application domain (cf. Shafran et al., 2003; Narayanan, 2002; Ang et al., 2002; Lee et al., 2002; Batliner et al., 2000; Devillers et al., 2003). Thus, more empirical work is needed to explore whether and how the use of similar techniques can be effectively used to model student states in spoken dialogue tutoring systems. In addition, past research using naturally occurring speech has studied only human human (Devillers et al., 2003), human computer (Shafran et al., 2003; Lee et al., 2001; Lee et al., 2002; Narayanan, 2002; Ang et al., 2002), or wizard-of-oz (Batliner et al., 2000; Batliner et al., 2003; Narayanan, 2002) dialogue data. Just as previous work has demonstrated that results based on acted or read speech transfer poorly to spontaneous speech, more empirical work is needed to explore whether and how results regarding emotion prediction transfer across different types of naturally occurring spoken dialogue data, i.e. spoken dialogues between humans versus spoken dialogues between humans and computers, and/or spoken dialogues from different application domains. In this paper, we examine the relative utility of the acoustic prosodic and lexical information in student utterances, both with and without student and task dependent information, for recognizing student emotions and attitudes in spoken tutoring dialogues; we also examine the impact of using human transcriptions versus noisier system output for obtaining such information. Our methodology builds on and generalizes the results of prior research from the area of spoken dialogue, while applying them to the new domain of naturally occurring tutoring dialogues (in the domain of qualitative physics). Our work is also novel in replicating 1 We have also found a correlation between the ratio of negative/neutral student states and learning gains in our intelligent tutoring spoken dialogue data (to be described below), although these results are very preliminary

3 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx our analyses across two comparable spoken dia- 138 logue corpora: one with a computer tutor, and the 139 other with a human tutor performing the same task 140 as our computer system. Although these corpora 141 were collected under comparable experimental con- 142 ditions, they differ with respect to many characteris- 143 tics, such as utterance length and student initiative. 144 Given the current limitations of both speech and 145 natural language processing technologies, computer 146 tutors are far less flexible than human tutors, and 147 also make more errors. The use of human tutors 148 thus represents an ideal computer system, and 149 thereby provides a benchmark for estimating the 150 performance of our emotion recognition methods, 151 at least with respect to speech and natural language 152 processing performance. 153 In our experiments, we first annotate student 154 turns in both of our spoken dialogue tutoring cor- 155 pora for negative, neutral, and positive emotions 156 and attitudes. We then create two datasets for each 157 corpus: an Agreed dataset containing only those 158 student turns whose annotations were originally 159 agreed on by the annotators, and a Consensus 160 dataset containing all annotated student turns, 161 where original disagreements were given a consen- 162 sus label. These datasets are summarized in Table We then automatically extract acoustic prosodic 164 features from the speech signal of our annotated 165 student turns, and lexical items from the transcribed 166 or recognized speech, and perform a variety of ma- 167 chine learning experiments to predict our emotion 168 categorizations using different feature set combina- 169 tions. Overall, our results show that by using acous- 170 tic prosodic features alone, or in combination with 171 identifier features identifying specific subjects and 172 tutoring problems, or in combination with lexical 173 information, we can significantly improve over 174 baseline (majority class) performance figures for 175 emotion prediction. Our highest prediction accura- 176 cies are obtained by combining multiple feature 177 types and by predicting only those annotated stu- 178 dent turns that both annotators agreed on. Table summarizes these results in our human human 180 corpus, and Table 19 summarize these results in 181 our human computer corpus. However, simpler 182 models containing only a subset of features (or fea- 183 ture types) work comparably in many experiments, 184 and these simpler models often have the advantage 185 in terms of ease of implementation and/or do- 186 main-independence. While many of our observa- 187 tions generalized across the human human and human computer dialogues, we also find interesting differences between recognizing emotion in our two corpora, and also as compared to prior studies in other domains. In general, lexical features yielded higher predictive utility than acoustic prosodic features. Within acoustic prosodic features, there was a trend for temporal features to have the highest predictive utility, followed by energy features and lastly, pitch features. However, the usefulness of acoustic prosodic features varied across experiments and corpora; indeed, across prior research as a whole, the usefulness of particular acoustic prosodic features appears to be often domaindependent. Similarly, identifier features, whose use is limited to domains such as ours where there is a limited problem set and students reuse the tutoring system repeatedly, were found to have higher predictive utility in our human computer corpus as compared to our human human corpus. In sum, our recognition results provide an empirical basis for the next phase of our research, which will be to enhance our spoken dialogue tutoring system to automatically recognize and ultimately to adapt to student states. Section 2 describes ITSPOKE, our intelligent tutoring spoken dialogue system and the corpus it produces, as well as a human human spoken tutoring corpus that corresponds to the human computer corpus produced by ITSPOKE. Section 3 describes our annotation scheme for manually labeling student emotions and attitudes, and evaluates inter-annotator agreement when this scheme is used to annotate student states in dialogues from both our human human and human computer corpora. Section 4 discusses how acoustic and prosodic features available in real-time to ITSPOKE are computed from our dialogues. Section 5 then presents our machine learning experiments in automatic emotion recognition, analyzing the predictive performance of acoustic prosodic features alone or in combination, both with and without subject and task-dependent information. Section 6 investigates the impact of both adding a lexical feature representing the transcription of the student turn, and for the human computer dialogues, using the noisy output of the speech recognizer rather than the actual transcription. Finally, Section 7 discusses related research, while Section 8 summarizes our results and describes our current and future directions

4 4 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx Spoken dialogue tutoring corpora Common aspects of the corpora 240 Our data for this paper come from spoken inter- 241 actions between student and tutor, through which 242 students learn to solve qualitative physics problems, 243 i.e. thought-provoking explain or why type 244 physics problems that can be answered without 245 doing any mathematics. We have collected two cor- 246 pora of these spoken tutoring dialogues, which are 247 distinguished according to whether the tutor is a hu- 248 man or a computer. 249 In these spoken tutoring corpora, dialogue inter- 250 action between student and tutor is mediated via a 251 web interface, supplemented with a high-quality 252 audio link. An example screenshot of this web inter- 253 face, generated during an interaction between a stu- 254 dent and the computer tutor, is shown in Fig. 1. The 255 qualitative physics problem (problem 58) is shown in 256 the upper right box. The student begins by typing an 257 essay answer to this problem in the middle right box. 258 When finished with the essay, the student clicks the 259 SUBMIT button. 2 The tutor then analyzes the es- 260 say and engages the student in a spoken natural lan- 261 guage dialogue to provide feedback and correct 262 misconceptions in the essay, and to elicit more com- 263 plete explanations. The middle left box in Fig. 1 is 264 used during human computer tutoring to record 265 the dialogue history. This box remains empty during 266 human human tutoring, because both the student 267 and tutor utterances would require manual tran- 268 scription before they could be displayed. In the hu- 269 man computer tutoring, in contrast, the speech 270 recognition and speech synthesis components of 271 the computer tutor can be used to provide the tran- 272 scriptions. After the dialogue between tutor and 273 student is completed, the student revises the essay, 274 thereby ending the tutoring for that physics problem 275 or causing another round of tutoring/essay revision. 276 The experimental procedure for collecting both 277 our spoken tutoring corpora is as follows 3 : (1) stu- 278 dents are given a pre-test measuring their knowledge 279 of physics, (2) students are asked to read through a small document of background material, 4 (3) students use the web and voice interface to work through a set of up to 10 training physics problems with the (human or computer) tutor, and (4) students are given a post-test that is similar to the pre-test. The experiment typically takes no more than 7 h per student, and is performed in 1 2 sessions. Students are University of Pittsburgh students who have never taken a college level physics course, and who are native speakers of American English The human human spoken dialogue tutoring corpus Our human human spoken dialogue tutoring corpus contains 128 transcribed dialogues (physics problems) from 14 different students, collected from Fall 2002 Fall One human tutor participated. The student and the human tutor were separated by a partition, and spoke to each other through headmounted microphones. Each participantõs speech was digitally recorded on a separate channel. Transcription and turn-segmentation of the student and tutor speech were then done by a paid transcriber. The transcriber added a turn boundary when: (1) the speaker stopped speaking and the other party in the dialogue began to speak, (2) the speaker asked a question and stopped speaking to wait for an answer, (3) the other party in the dialogue interrupted the speaker and the speaker paused to allow the other party to speak. An emotion-annotated (Section 3) excerpt from our human human tutoring corpus is shown in Fig. 2. In the human human corpus, interruptions and overlapping speech are common; turns ending in - (as in TUTOR 6, Fig. 2) indicate when speech overlaps with the following turn, and other punctuation has been added to the transcriptions for readability The human computer spoken dialogue tutoring corpus Our human computer spoken dialogue tutoring corpus contains 100 dialogues (physics problems) from 20 students, collected from Fall 2003 Spring The Tell Tutor box is used for typed student login and logout. 3 Our spoken tutoring corpora were collected as part of a wider evaluation comparing student learning across speech-based and text-based human human and human computer tutoring conditions (Litman et al., 2004). 4 In the computer tutoring experiment, the pre-test was moved to after the background reading, to allow us to measure learning gains caused by the experimental manipulation without confusing them with gains caused by the background reading.

5 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 5 Fig. 1. Screenshot during human computer spoken tutoring dialogue. Fig. 2. Annotated excerpt from human human spoken tutoring corpus Our computer tutor is called ITSPOKE 323 (Intelligent Tutoring SPOKEn dialogue system) (Litman and Silliman, 2004). ITSPOKE uses as its back-end the text-based Why2-Atlas dialogue

6 6 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 326 tutoring system (VanLehn et al., 2002), which han- 327 dles syntactic and semantic analysis (Rosé, 2000), 328 discourse and domain level processing (Jordan and 329 VanLehn, 2002; Jordan et al., 2003), and finite-state 330 dialogue management (Rosé et al., 2001). 331 To analyze the typed student essay, the Why2-332 Atlas back-end first parses the student essay into 333 propositional representations, in order to find useful 334 dialogue topics. It uses three different approaches 335 (symbolic, statistical and hybrid) competitively to 336 create a representation for each sentence, then re- 337 solves temporal and nominal anaphora and con- 338 structs proofs using abductive reasoning (Jordan 339 et al., 2004). 340 During the subsequent dialogue, student speech 341 is digitally recorded from head-mounted micro- 342 phone input. Barge-ins and overlaps are not cur- 343 rently permitted. 5 The student speech is sent to the Fig. 3. Annotated excerpt from human computer spoken tutoring corpus. Sphinx2 speech recognizer (Huang et al., 1993), whose stochastic language models have a vocabulary of 1240 words and are trained with 7720 student utterances from evaluations of Why2-Atlas and from pilot studies of ITSPOKE. Transcription (speech recognition) and turn-segmentation is done automatically in ITSPOKE. However, because speech recognition is imperfect, the human computer data is also manually transcribed, for comparison. Sphinx2Õs most probable transcription (recognition output) is sent to the Why2-Atlas back-end for natural language understanding. The dialogue is managed by a finite-state dialogue manager, where nodes correspond to tutor turns, and arcs to student turns. Why2-AtlasÕ natural language understanding (NLU) component associates a semantic grammar with each tutor question (i.e., with each node in the dialogue finite-state machine); grammars across questions may share rules. The categories in the grammar correspond to the expected responses for the question (i.e., to the arcs exiting the question node in the finite-state machine), and represent both correct answers and typical student misconceptions (VanLehn et al., 2002). Given a stu- 5 Although not yet evaluated, our next version of ITSPOKE supports barge-in, and thus allows the student to interrupt ITSPOKE when it is speaking, e.g., when it is giving a long explanation

7 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx dentõs utterance, the output of the NLU component 369 is thus a subset of the semantic concepts that were 370 expected as answers to the tutorõs prior question, 371 and that were found when parsing the studentõs 372 utterance. For instance, the semantic concept down- 373 ward is used in many of the semantic grammars, and 374 would be the semantic output for a variety of utter- 375 ances such as downwards, towards earth, is it 376 downwards, down, etc. 377 The text response produced by Why2-Atlas (i.e., 378 the next node in the finite-state machine) is then sent 379 to the Cepstral text-to-speech system 6 and played to 380 the student through the headphone. After each system 381 prompt or student utterance is spoken, the system 382 prompt, or the systemõs understanding of the studentõs 383 response (i.e., the output of the speech recognizer), 384 respectively, are added to the dialogue history. At 385 the time the screenshot in Fig. 1 was generated, for 386 example, the student had just said free fall (in this 387 case the utterance was correctly recognized). 388 An emotion-annotated (Section 3) dialogue ex- 389 cerpt from our human computer corpus is shown 390 in Fig. 3. The excerpt shows both what the student 391 said and what ITSPOKE recognized (the ASR 392 annotations). As shown, the output of the auto- 393 matic speech recognizer sometimes differed from 394 what the student actually said. When ITSPOKE 395 was not confident of what it thought the student 396 said, it generated a rejection prompt and asked the 397 student to repeat. On average, ITSPOKE produced rejection prompts per dialogue. ITSPOKE also 399 misrecognized utterances; when ITSPOKE heard 400 something different than what the student said (as 401 with the last student turn) but was confident in its 402 hypothesis, it proceeded as if it heard correctly. 403 While the ITSPOKE word error rate in this corpus 404 was 31.2%, natural language understanding based 405 on speech recognition (i.e., the recognition of 406 semantic concepts instead of actual words) is the 407 same as based on perfect transcription 92.4% of 408 the time. 7 The accuracy of recognizing semantic 409 concepts is more relevant for dialogue evaluation, 410 as it does not penalize for word errors that are 411 unimportant to overall utterance interpretation. 3. An annotation scheme for student emotion and attitude 3.1. Emotion classes In our data, student emotions 8 can only be identified indirectly: via what is said and/or how it is said. However, such evidence is not always obvious, unambiguous, or consistent. For example, a student may express anger through the use of swear words, or through a particular tone of voice, or via a combination of signals, or not at all. Moreover, another student may present some of these same signals even when s/he does not feel anger. In (Litman and Forbes-Riley, 2004a), we present a coding scheme for manually annotating the student turns 9 in our spoken tutoring dialogues for intuitively perceived expression of emotion. In this scheme, expressions of emotion 10 are viewed along a linear scale, shown and defined as follows 11 : negative neutral! positive Negative 12 : a student turn that expresses emotions such as confused, bored, irritated, uncertain, sad. Examples of negative student turns in our human human and human computer corpora are found 6 The Cepstral system is a commercial outgrowth of the Festival system (Black and Taylor, 1997). 7 An internal evaluation of this semantic analysis component in an early version of the Why2-Atlas system (with its typed input, and thus perfect transcription) yielded 97% accuracy (Rose, 2005). 8 In the rest of this paper, we will use the term emotion loosely, to cover both affects and attitudes that can impact student learning. Although some argue that emotion should be distinguished from attitude, some speech researchers have found that the narrow sense of emotion is too restrictive because it excludes states in speech where emotion is present but not full-blown, including arousal and attitude (Cowie and Cornelius, 2003). Some tutoring researchers have also found it useful to take a combined view of affect and attitude (Bhatt et al., 2004). 9 We use the terms turn and utterance interchangeably in this paper. 10 Although an expression of emotion is not interchangeable with the emotion itself (Russell et al., 2003), our use of the term emotion hereafter should be understood as referring (when appropriate) to the annotated expression of emotion. 11 In (Litman and Forbes-Riley, 2004a), we have also explored separately annotating strong, weak and mixed emotions, as well as annotating specific emotions such as uncertain, irritated, confident; complete details of our annotation studies are described therein. 12 These negative, neutral and positive emotion classes correspond to traditional notions of valence (cf. Cowie and Cornelius, 2003), but these terms are not related to the impact of emotion on learning. For example, in work that draws on a disequilibrium theory of the relationship between emotion and learning, working through negative emotions is believed to be a necessary part of the learning process (Craig and Graesser, 2003)

8 8 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 436 in Figs. 2 and 3. Evidence 13 of a negative emotion 437 can come from the lexical expressions of uncer- 438 tainty, e.g., the phrase I donõt know, a syntactic 439 question, disfluencies, as well as acoustic and pro- 440 sodic features, including pausing, pitch and energy 441 variation. For example, the negative student turn, 442 student 5, in Fig. 2, contains the phrase I donõt 443 know why, as well as frequent internal pausing 444 and a wide pitch variation. 14 The negative student 445 turn, student 19, in Fig. 3, displays a slow tempo 446 and rising intonation. 447 Positive: a student turn that expresses emotions 448 such as confident, enthusiastic. For example, student in Fig. 2 is labeled positive. Evidence of a positive 450 emotion in this case comes from lexical expressions 451 of certainty, e.g., ItÕs the..., as well as acoustic 452 and prosodic features, including loud speech and a 453 fast tempo. The positive student turn, student 21,in 454 Fig. 3, displays a fast tempo with very little pausing 455 preceding the utterance. 456 Neutral: a student turn that does not express a po- 457 sitive or negative emotion. Examples of neutral stu- 458 dent turns are student 8 in Fig. 2 and student 22 in 459 Fig. 3. Acoustic and prosodic features, including 460 moderate loudness, tempo, and inflection, give evi- 461 dence for these neutral labels, as does the lack of 462 semantic content in the grounding phrase, mm-hm. 463 Emotion annotations were performed from both 464 audio and transcription using the sound visualiza- 465 tion and manipulation tool, Wavesurfer. 15 The emo- 466 tion annotators were instructed to try to annotate 467 emotion relative to both context and task. By con- 468 text-relative we mean that a student turn in our 469 tutoring dialogues is identified as expressing emo- 470 tion relative to the other student turns in that dia- 471 logue. By task-relative we mean that a student 472 turn perceived during tutoring as expressing an 473 emotion might not be perceived as expressing the 474 same emotion with the same strength in another 475 (e.g., non-tutoring) situation. Moreover, the range 476 of emotions that arise during tutoring might not 477 be the same as the range of emotions that arise dur- ing some other task. For example, consider the context of a tutoring session, where a student has been answering tutor questions with apparent ease. If the tutor then asks another question, and the student responds slowly, saying Um, now IÕm confused, this turn would likely be labeled negative. However, in the context of a heated argument between two people, this same turn might be labeled as a weak negative, or even weak positive. Litman and Forbes-Riley (2004a) provides full details of our annotation scheme, including discussion of our coding manual and annotation tool, while Section 7 compares our scheme to related work Quantifying inter-annotator agreement We conducted a study for each corpus, to quantify the degree of agreement among two coders (the authors) in classifying utterances using our annotation scheme. To analyze agreement in our human human spoken tutoring corpus (Section 2.2), we randomly selected 10 transcribed dialogues from 9 subjects, yielding a dataset of 453 student turns, where approximately 40 turns came from each of the 9 subjects. The 453 turns were separately annotated by the two authors, using the emotion annotation scheme described above. To analyze agreement in our human computer corpus (Section 2.3), we randomly selected 15 transcribed dialogues from 10 subjects, yielding a dataset of 333 student turns, where approximately 30 turns came from each of 10 subjects. Each turn was again separately annotated by the two authors. Two confusion matrices summarizing the resulting agreement between the two emotion annotators for each corpus are shown in Tables 1 and 2. The rows correspond to the labels assigned by annotator 1, and the columns correspond to the labels assigned 13 As determined by post-annotation discussion (see Section 7). 14 As illustrated by the hyperlinks in Figs. 2 and 3, annotators could also listen to the recording of the dialogue, as detailed below. If your pdf reader does not support hyperlinks, you can listen to these dialogue excerpts at this website: ( The tool is shown in Figs. 5 and 6. Table 1 Confusion matrix for human human corpus annotation Negative Neutral Positive Negative Neutral Positive Table 2 Confusion matrix for human computer corpus annotation Negative Neutral Positive Negative Neutral Positive

9 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx by annotator 2. For example, in Table 1, 112 nega- 515 tives were agreed upon by both annotators, while of the negatives assigned by annotator 1 were la- 517 beled as neutral by annotator 2, and 9 of the nega- 518 tives assigned by annotator 1 were labeled as 519 positive by annotator 2. Note that across both cor- 520 pora, annotator 2 consistently annotates more posi- 521 tive and less neutral turns than annotator As shown along the diagonal in Table 1, the two 523 annotators agreed on the annotations of 340/ student turns on the human human tutoring data, 525 achieving 75.1% agreement (Kappa = 0.6, a = ). 16 As shown along the diagonal in Table 2, the 527 two annotators agreed on the annotations of 202/ student turns in the human computer tutoring 529 data, achieving 60.7% agreement (Kappa = 0.4, 530 a = 0.4). 17 It has generally been found to be difficult 531 to achieve levels of inter-annotator agreement above 16 Kappa and a are metrics for computing the pairwise agreement among annotators making category judgments. Kappa (Carletta, 1996; Siegel et al., 1988; Cohen, 1960) is computed as: PðAÞ PðEÞ 1 PðEÞ, where P(A) is the proportion of actual agreement among annotators, and P(E) is the proportion of agreement expected by chance. a (Krippendorf, 1980) is computed as: 1 Dð0Þ DðEÞ, where D(O) is the proportion of observed disagreement between annotators and D(E) is the proportion of disagreement expected by chance. When there is no agreement other than that expected by chance, Kappa and a = 0. When there is total agreement, Kappa and a =1.KrippendorfÕs (1980) a and Siegel et al.õs (1988) version of Kappa are nearly identical; however, these two metrics use a different method of estimating the probability distribution for chance than does CohenÕs (1960) version of Kappa (DiEugenio and Glass, 2004), which is used in this paper. Although interpreting the strength of inter-annotator agreement is controversial (DiEugenio and Glass, 2004), Landis and Koch (1977) and others use the following standard for Kappa: , Fair ; , Moderate ; , Substantial ; , Almost Perfect. Krippendorf (1980) uses the following stricter standard for a: a <.67, cannot draw conclusions ;.67 < a >.8, allows tentative conclusions ; a >.8, allows definite conclusions. Although neither metric is ideal for this study because they assume independent events, unlike other measures of agreement such as percent agreement, Kappa and a take into account the inherent complexity of a task by correcting for chance expected agreement. 17 Since our emotion categories are ordinal/interval rather than nominal, we can also quantify agreement using a weighted version of Kappa (Cohen, 1968), which accounts for the relative distances between successive categories. With (quadratic) weighting, our Kappa values increase to.7 and.5 for the human human and human computer annotations, respectively. Similarly, using an interval version of a (Krippendorf, 1980) that also accounts for a relative distance (of 1) between categories, a values increase to.7 and.5 for the human human and human computer annotations, respectively. Table 3 Consensus labeling over emotion-annotated data Negative Neutral Positive Human human Human computer agreed data and our consensus-labeled data. Moderate (see footnote 16) for emotion annota- 532 tion in naturally occurring dialogues. Ang et al. 533 (2002), for example, report inter-annotator agree- 534 ment of 71% (Kappa 0.47), while Shafran et al. 535 (2003) report Kappas ranging between 0.32 and Such studies were nevertheless able to use 537 acoustic prosodic cues to effectively distinguish 538 these annotator judgments of emotion. 539 A number of researchers have accommodated for 540 low inter-annotator agreement for emotion annota- 541 tion by exploring ways of achieving consensus be- 542 tween disagreed annotations. Following Ang et al. 543 (2002) and Devillers et al. (2003), we explored con- 544 sensus labeling, both with the goal of increasing 545 our usable dataset for prediction, and to include 546 the more difficult annotation cases. For our consen- 547 sus labeling, the original annotators revisited each 548 originally disagreed case, and through discussion, 549 sought a consensus label. Due to consensus labeling, 550 agreement rose in both our human human and hu- 551 man computer data to 100%. 18 A summary of the 552 distribution of emotion labels after consensus label- 553 ing is shown in Table As in (Ang et al., 2002), we will experiment with 555 predicting emotions in Section 5 using both our Table 4 summarizes the characteristics of the emo- 558 tion-annotated subsets of both our human and com- 559 puter tutoring corpora, with respect to both the 560 agreed and consensus emotion labels. 561 As a final note, during the annotation and subse- 562 quent consensus discussions, we observed that the 563 human human and human computer dialogues dif- 564 fer with respect to a variety of characteristics. Many 565 of these differences are illustrated in the corpus There were eight student turns in the human human corpus for which the annotators had difficulty deciding upon a consensus label; these cases were given the neutral consensus label as a result. 19 Although not discussed in this paper, we have also run prediction experiments using each individual annotatorõs labeled data; the results in each case were lower than those for the agreed data, and were approximately the same as the results for the consensus-labeled data, as discussed below.

10 10 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx Table 4 Summary of emotion-annotated data Human human Human computer Agreed Consensus Agreed Consensus # students # dialogues # student turns # student words # unique student words Minutes student speech Majority class (neutral) 53% 60% 47% 48% 567 excerpts above, and in part reflect the fact that our 568 computer tutor is far less robust than our human tu- 569 tor with respect to its interactiveness and under- 570 standing capabilities. Such differences can 571 potentially impact both the emotional state of the 572 student, and how the student is able to express an 573 emotional state. We hypothesize that such differ- 574 ences may have also impacted the comparative dif- 575 ficulty in annotating emotion in the two corpora. 576 For example, the average student turn length in 577 the 10 annotated human human dialogues is words, while for the 15 human computer dialogues 579 the average turn length is 2.52 words. The fact that 580 students speak less in the human computer dia- 581 logues means that there is less information to make 582 use of when judging expressed emotions. We also 583 observed that in the human human dialogues, there 584 are more student initiatives and groundings as well 585 as references to prior problems. The limitations of 586 the computer tutor may have thus restricted how 587 students expressed themselves (including how they 588 expressed their own emotional states) in other ways besides word quantity. Finally, the fact that the computer tutor made processing errors may have impacted both the types and quantity of student emotional states. As shown in Table 3, there is a higher proportion of negative emotions in the human computer corpus as compared to the human human corpus (38% versus 26%, respectively). As we will see with our machine learning experiments in Section 5, emotion prediction is also more difficult in the human computer corpus, which may again in part reflect its differing dialogue characteristics that arise from the limitations of the computer tutor. 4. Extracting features from the speech signal of student turns 4.1. Acoustic prosodic features For each of the emotion-annotated student turns in the human human and human computer corpora, we computed the 12 acoustic and prosodic features itemized in Fig. 4, for use in the machine Fig. 4. Twelve acoustic prosodic features per student turn.

11 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx learning experiments described in Section Moti- 609 vated by previous studies of emotion prediction in 610 spontaneous dialogues in other domains (Ang 611 et al., 2002; Lee et al., 2001; Batliner et al., 2003), 612 our acoustic prosodic features represent knowledge 613 of pitch, energy, duration, tempo and pausing. As- 614 pects of silence and pausing have been shown to 615 be relevant for categorizing other aspects of student 616 behavior in tutoring dialogues as well (Fox, 1993; 617 Shah et al., 2002). We focus on acoustic and pro- 618 sodic features of individual turns that can be com- 619 puted automatically from the speech signal and 620 are available in real-time to ITSPOKE, since our 621 long-term goal is to use these features to trigger on- 622 line adaptation in ITSPOKE based on predicted 623 student emotions. 624 F0 and RMS values, representing measures of 625 pitch excursion and loudness, respectively, were 626 computed using Entropic Research LaboratoryÕs 627 pitch tracker, get_f0, 21 with no post-correction. A 628 pitchtracker takes as input a speech file, and outputs 629 a fundamental frequency (f0) contour (the physical 630 correlate of pitch). In Fig. 5, for example, the Pitch 631 Pane displays the f0 contour for the experimentally 632 obtained speech file shown in the Student Speech 633 pane, where the x-axis represents time and the y-axis 634 represents frequency in Hz. 22 Each f0 value corre- 635 sponds to a frame step of 0.01 s across the student 636 turn free fall? ; the rising f0 contour is typical of 637 a question. Our features maxf0 and minf0 cor- 638 respond to the highest and lowest f0 values (the 639 peaks and valleys) in the f0 contour, while 640 meanf0 and stdf0 are based on averaging over 641 all the (non-zero) f0 values in the contour, which are 642 given by get_f0 in frame steps of 0.01 s. 643 Energy can alternatively be represented in terms 644 of decibels (db) or root mean squared amplitude 20 In preliminary experiments for this paper, and also in previous work (e.g., Litman and Forbes, 2003), we also investigated the use of two normalized versions of our acoustic prosodic features, specifically, features normalized by either prior turn or by first turn. These normalizations have the benefit of removing the gender dependency of f0 features. However, we have consistently found little difference in predictive utility for raw versus normalized features, in both our human human and human computer data, so use only raw (non-normalized) feature values here. As discussed below, however, we experiment with the use of gender as an explicit feature. 21 get_f0 and other Entropic software is currently available free of charge at 22 The representations in Figs. 5 and 6 use the Wavesurfer sound visualization and manipulation tool. of the actual student turn boundaries. (rms). For example, the Energy Pane at the bot- 645 tom of Fig. 5 displays energy values computed in 646 decibels across frame steps of 0.01 s for the student 647 speech shown in the Student Speech pane, where 648 the x-axis represents time and the y-axis represents 649 decibels. The variation in energy values across this 650 student turn reflects that the studentõs utterance it- 651 self is much louder than the silences before and 652 after (although as can be seen, the analysis picks 653 up some minor background noise when the student 654 is not speaking). The get_f0 pitch tracker used in 655 this study computes energy as rms values based on 656 a 0.03-s window within frame steps of 0.01 s. max- 657 rms and minrms correspond to the highest and 658 lowest rms values over all the frames in a student 659 turn, while meanrms and stdrms are based on 660 averaging over all the rms values in the frames in 661 a student turn. 662 Our four temporal features were computed from 663 the turn boundaries of the transcribed speech. Re- 664 call that during our corpora collection, student 665 and tutor speech are digitally recorded separately, 666 yielding a 2-channel speech file for each dialogue. 667 In Fig. 6, the Tutor Speech and Student Speech 668 panes show a portion of the tutor and student 669 speech files, while the Tutor Text and Student 670 Text show the associated transcriptions. The verti- 671 cal lines around each tutor and student utterance 672 correspond to the turn segmentations. For example, 673 the leftmost vertical line indicates that the tutorõs 674 turn what is that motion? begins at approxi- 675 mately s (30,450 ms) into the dialogue. Recall 676 that in our human human dialogues, these turn 677 boundaries are manually labeled by our paid tran- 678 scriber. In our human computer dialogues, tutor 679 turn boundaries correspond to the beginning and 680 end times of the speech synthesis process, while stu- 681 dent turn boundaries correspond to the beginning 682 and end times of the student speech as detected by 683 the speech recognizer, and thus are a noisy estimate The duration of each student turn was calculated 686 by subtracting the turnõs beginning from its ending 687 time. In Fig. 6, for example, the duration of the stu- 688 dentõs turn is approximately 0.90 s ( s) While we manually transcribed the lexical information to quantify the error due to speech recognition, we did not manually relabel turn boundaries; we thus can not quantify the level of noise introduced by automatic turn segmentation.

12 12 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx 691 The preceding pause (prepause) before a student 692 turn began was calculated by subtracting the ending 693 time of the tutorõs (prior) turn from the beginning 694 time of the studentõs turn. In Fig. 6, for example 695 the duration of the pause preceding the studentõs 696 turn is 0.85 s ( s) The speaking rate (tempo) was calculated as syl- 698 lables per second in the turn (where the number of 699 syllables in the transcription was computed using 700 the Festival text-to-speech OALD dictionary, and 701 the turn duration computed as above). 25 For exam- 702 ple, in Fig. 6, there are five syllables in the student 703 turn ( the freefall motion ), and the duration of 704 the student turn is 0.90 s (as computed above), thus 705 the speaking rate in the turn is 5.56 syllables/second. 706 In this paper, we computed tempo in the human 707 computer dialogues based on the human transcrip- 708 tion of the student turns. Although this more closely 709 reflects the actual tempo rather than the noisier tem- 710 po computed on the automatic speech recognition 24 Note that in the human human corpus, if a student turn began before the prior tutor turn ended (i.e., student barge-ins and overlaps), the preceding pause feature for that turn was 0. If a student turn initiated a dialogue or was preceded by a student turn (rather than a tutor turn), its preceding pause feature was not defined for that turn. In the human computer corpus, every student turn is preceded by a tutor turn. 25 Note that this method calculates only a single (average) tempo of the turn, because we were not sampling the tempo at subintervals throughout the turn. Fig. 5. Computing pitch and energy-based features. output, Ang et al. (2002) compared machine learning experiments using features such as tempo computed both on the human transcription and on the automatically recognized speech, and found that the prediction results were comparable. Amount of silence (intsilence) was defined as the percentage of frames in the turn where the probability of voicing = 0; this probability is available from the output of the get_f0 pitch-tracker, and the resulting percentage represents roughly the percentage of time within the turn that the student was silent. 26 For example, the student turn in Fig. 6 has approximately 31% internal silence Adding Identifier features representing the student and problem Finally, we also recorded for each turn the 3 identifier features shown in Fig. 7, all of which are automatically available in ITSPOKE through student login. Prior studies (Oudeyer, 2002; Lee et al., 2002) have shown that subject and gen- 26 Using the percentage of unvoiced frames as a measure of silence will overestimate the amount of silence in the turn, because e.g., long unvoiced fricatives will be included; however it has been used in previous work as a rough estimate of internal silence (Litman et al., 2001). In our data, energy was rarely zero across the individual frames per turn, and thus was not a better estimate of internal silence

13 D.J. Litman, K. Forbes-Riley / Speech Communication xxx (2005) xxx xxx der features can play an important role in emotion 732 recognition, because different genders and different 733 speakers can convey emotions differently. subject 734 ID and problem ID are uniquely important in 735 our tutoring domain, because in contrast to e.g., call 736 centers, where most callers are distinct, students will 737 use our system repeatedly, and problems are re- 738 peated across students Predicting student emotion from 740 acoustic prosodic features 741 We next performed machine learning experi- 742 ments with acoustic prosodic features and our emo- 743 tion-annotated student turns, to explore how well Fig. 6. Computing temporal features. Fig. 7. Three identifier features per student turn. the 12 acoustic prosodic features discussed in Section 4 predict the emotion labels in both our human human and human computer tutoring corpora. We explore the predictions for our originally agreed emotion labels and our consensus emotion labels (Section 3.2). Using originally agreed data is expected to produce better results, since presumably annotators originally agreed on cases that provide more clear-cut prosodic information about emotional features (Ang et al., 2002), but using consensus data is worthwhile because it includes the less clear-cut data that the computer will actually encounter. For these experiments, we use a boosting algorithm in the Weka machine learning software (Witten and Frank, 1999). In general, the boosting algorithm, called AdaBoostM1 in Weka, enables the accuracy of a weak learning algorithm to be improved by repeatedly applying that algorithm to different distributions or weightings of training examples, each time generating a new weak prediction rule, and eventually combining all weak predic- 27 In preliminary experiments for this paper, we examined the use of only subject ID as well as only subject ID and gender, with the view that these identifier features generalized to other domains besides physics. Overall we found that these two subsets produced results that were the same as including the problem ID feature

Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S *

Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S * Humor: Prosody Analysis and Automatic Recognition for F * R * I * E * N * D * S * Amruta Purandare and Diane Litman Intelligent Systems Program University of Pittsburgh amruta,litman @cs.pitt.edu Abstract

More information

Acoustic Prosodic Features In Sarcastic Utterances

Acoustic Prosodic Features In Sarcastic Utterances Acoustic Prosodic Features In Sarcastic Utterances Introduction: The main goal of this study is to determine if sarcasm can be detected through the analysis of prosodic cues or acoustic features automatically.

More information

WEB FORM F USING THE HELPING SKILLS SYSTEM FOR RESEARCH

WEB FORM F USING THE HELPING SKILLS SYSTEM FOR RESEARCH WEB FORM F USING THE HELPING SKILLS SYSTEM FOR RESEARCH This section presents materials that can be helpful to researchers who would like to use the helping skills system in research. This material is

More information

Improving Frame Based Automatic Laughter Detection

Improving Frame Based Automatic Laughter Detection Improving Frame Based Automatic Laughter Detection Mary Knox EE225D Class Project knoxm@eecs.berkeley.edu December 13, 2007 Abstract Laughter recognition is an underexplored area of research. My goal for

More information

Audio Feature Extraction for Corpus Analysis

Audio Feature Extraction for Corpus Analysis Audio Feature Extraction for Corpus Analysis Anja Volk Sound and Music Technology 5 Dec 2017 1 Corpus analysis What is corpus analysis study a large corpus of music for gaining insights on general trends

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

MELODIC AND RHYTHMIC CONTRASTS IN EMOTIONAL SPEECH AND MUSIC

MELODIC AND RHYTHMIC CONTRASTS IN EMOTIONAL SPEECH AND MUSIC MELODIC AND RHYTHMIC CONTRASTS IN EMOTIONAL SPEECH AND MUSIC Lena Quinto, William Forde Thompson, Felicity Louise Keating Psychology, Macquarie University, Australia lena.quinto@mq.edu.au Abstract Many

More information

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution. CS 229 FINAL PROJECT A SOUNDHOUND FOR THE SOUNDS OF HOUNDS WEAKLY SUPERVISED MODELING OF ANIMAL SOUNDS ROBERT COLCORD, ETHAN GELLER, MATTHEW HORTON Abstract: We propose a hybrid approach to generating

More information

Using the new psychoacoustic tonality analyses Tonality (Hearing Model) 1

Using the new psychoacoustic tonality analyses Tonality (Hearing Model) 1 02/18 Using the new psychoacoustic tonality analyses 1 As of ArtemiS SUITE 9.2, a very important new fully psychoacoustic approach to the measurement of tonalities is now available., based on the Hearing

More information

Chapter Two: Long-Term Memory for Timbre

Chapter Two: Long-Term Memory for Timbre 25 Chapter Two: Long-Term Memory for Timbre Task In a test of long-term memory, listeners are asked to label timbres and indicate whether or not each timbre was heard in a previous phase of the experiment

More information

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music

More information

Computer Coordination With Popular Music: A New Research Agenda 1

Computer Coordination With Popular Music: A New Research Agenda 1 Computer Coordination With Popular Music: A New Research Agenda 1 Roger B. Dannenberg roger.dannenberg@cs.cmu.edu http://www.cs.cmu.edu/~rbd School of Computer Science Carnegie Mellon University Pittsburgh,

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

Speech and Speaker Recognition for the Command of an Industrial Robot

Speech and Speaker Recognition for the Command of an Industrial Robot Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr.

More information

Voice & Music Pattern Extraction: A Review

Voice & Music Pattern Extraction: A Review Voice & Music Pattern Extraction: A Review 1 Pooja Gautam 1 and B S Kaushik 2 Electronics & Telecommunication Department RCET, Bhilai, Bhilai (C.G.) India pooja0309pari@gmail.com 2 Electrical & Instrumentation

More information

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu

More information

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

AUD 6306 Speech Science

AUD 6306 Speech Science AUD 3 Speech Science Dr. Peter Assmann Spring semester 2 Role of Pitch Information Pitch contour is the primary cue for tone recognition Tonal languages rely on pitch level and differences to convey lexical

More information

A Comparison of Methods to Construct an Optimal Membership Function in a Fuzzy Database System

A Comparison of Methods to Construct an Optimal Membership Function in a Fuzzy Database System Virginia Commonwealth University VCU Scholars Compass Theses and Dissertations Graduate School 2006 A Comparison of Methods to Construct an Optimal Membership Function in a Fuzzy Database System Joanne

More information

Getting Started. Connect green audio output of SpikerBox/SpikerShield using green cable to your headphones input on iphone/ipad.

Getting Started. Connect green audio output of SpikerBox/SpikerShield using green cable to your headphones input on iphone/ipad. Getting Started First thing you should do is to connect your iphone or ipad to SpikerBox with a green smartphone cable. Green cable comes with designators on each end of the cable ( Smartphone and SpikerBox

More information

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene

However, in studies of expressive timing, the aim is to investigate production rather than perception of timing, that is, independently of the listene Beat Extraction from Expressive Musical Performances Simon Dixon, Werner Goebl and Emilios Cambouropoulos Austrian Research Institute for Artificial Intelligence, Schottengasse 3, A-1010 Vienna, Austria.

More information

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring 2009 Week 6 Class Notes Pitch Perception Introduction Pitch may be described as that attribute of auditory sensation in terms

More information

A Framework for Segmentation of Interview Videos

A Framework for Segmentation of Interview Videos A Framework for Segmentation of Interview Videos Omar Javed, Sohaib Khan, Zeeshan Rasheed, Mubarak Shah Computer Vision Lab School of Electrical Engineering and Computer Science University of Central Florida

More information

Analysis of local and global timing and pitch change in ordinary

Analysis of local and global timing and pitch change in ordinary Alma Mater Studiorum University of Bologna, August -6 6 Analysis of local and global timing and pitch change in ordinary melodies Roger Watt Dept. of Psychology, University of Stirling, Scotland r.j.watt@stirling.ac.uk

More information

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance Methodologies for Expressiveness Modeling of and for Music Performance by Giovanni De Poli Center of Computational Sonology, Department of Information Engineering, University of Padova, Padova, Italy About

More information

On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance

On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance RHYTHM IN MUSIC PERFORMANCE AND PERCEIVED STRUCTURE 1 On time: the influence of tempo, structure and style on the timing of grace notes in skilled musical performance W. Luke Windsor, Rinus Aarts, Peter

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

Measurement of overtone frequencies of a toy piano and perception of its pitch

Measurement of overtone frequencies of a toy piano and perception of its pitch Measurement of overtone frequencies of a toy piano and perception of its pitch PACS: 43.75.Mn ABSTRACT Akira Nishimura Department of Media and Cultural Studies, Tokyo University of Information Sciences,

More information

in the Howard County Public School System and Rocketship Education

in the Howard County Public School System and Rocketship Education Technical Appendix May 2016 DREAMBOX LEARNING ACHIEVEMENT GROWTH in the Howard County Public School System and Rocketship Education Abstract In this technical appendix, we present analyses of the relationship

More information

A repetition-based framework for lyric alignment in popular songs

A repetition-based framework for lyric alignment in popular songs A repetition-based framework for lyric alignment in popular songs ABSTRACT LUONG Minh Thang and KAN Min Yen Department of Computer Science, School of Computing, National University of Singapore We examine

More information

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Music Emotion Recognition. Jaesung Lee. Chung-Ang University Music Emotion Recognition Jaesung Lee Chung-Ang University Introduction Searching Music in Music Information Retrieval Some information about target music is available Query by Text: Title, Artist, or

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

Retrieval of textual song lyrics from sung inputs

Retrieval of textual song lyrics from sung inputs INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Retrieval of textual song lyrics from sung inputs Anna M. Kruspe Fraunhofer IDMT, Ilmenau, Germany kpe@idmt.fraunhofer.de Abstract Retrieving the

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional

More information

Sarcasm Detection in Text: Design Document

Sarcasm Detection in Text: Design Document CSC 59866 Senior Design Project Specification Professor Jie Wei Wednesday, November 23, 2016 Sarcasm Detection in Text: Design Document Jesse Feinman, James Kasakyan, Jeff Stolzenberg 1 Table of contents

More information

Acoustic and musical foundations of the speech/song illusion

Acoustic and musical foundations of the speech/song illusion Acoustic and musical foundations of the speech/song illusion Adam Tierney, *1 Aniruddh Patel #2, Mara Breen^3 * Department of Psychological Sciences, Birkbeck, University of London, United Kingdom # Department

More information

Feature-Based Analysis of Haydn String Quartets

Feature-Based Analysis of Haydn String Quartets Feature-Based Analysis of Haydn String Quartets Lawson Wong 5/5/2 Introduction When listening to multi-movement works, amateur listeners have almost certainly asked the following situation : Am I still

More information

Topics in Computer Music Instrument Identification. Ioanna Karydi

Topics in Computer Music Instrument Identification. Ioanna Karydi Topics in Computer Music Instrument Identification Ioanna Karydi Presentation overview What is instrument identification? Sound attributes & Timbre Human performance The ideal algorithm Selected approaches

More information

Computational Modelling of Harmony

Computational Modelling of Harmony Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond

More information

IMPROVING SIGNAL DETECTION IN SOFTWARE-BASED FACIAL EXPRESSION ANALYSIS

IMPROVING SIGNAL DETECTION IN SOFTWARE-BASED FACIAL EXPRESSION ANALYSIS WORKING PAPER SERIES IMPROVING SIGNAL DETECTION IN SOFTWARE-BASED FACIAL EXPRESSION ANALYSIS Matthias Unfried, Markus Iwanczok WORKING PAPER /// NO. 1 / 216 Copyright 216 by Matthias Unfried, Markus Iwanczok

More information

Reducing False Positives in Video Shot Detection

Reducing False Positives in Video Shot Detection Reducing False Positives in Video Shot Detection Nithya Manickam Computer Science & Engineering Department Indian Institute of Technology, Bombay Powai, India - 400076 mnitya@cse.iitb.ac.in Sharat Chandran

More information

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,

More information

Automatic Speech Recognition (CS753)

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 22: Conversational Agents Instructor: Preethi Jyothi Oct 26, 2017 (All images were reproduced from JM, chapters 29,30) Chatbots Rule-based chatbots Historical

More information

Music Performance Panel: NICI / MMM Position Statement

Music Performance Panel: NICI / MMM Position Statement Music Performance Panel: NICI / MMM Position Statement Peter Desain, Henkjan Honing and Renee Timmers Music, Mind, Machine Group NICI, University of Nijmegen mmm@nici.kun.nl, www.nici.kun.nl/mmm In this

More information

Detecting Musical Key with Supervised Learning

Detecting Musical Key with Supervised Learning Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different

More information

Phone-based Plosive Detection

Phone-based Plosive Detection Phone-based Plosive Detection 1 Andreas Madsack, Grzegorz Dogil, Stefan Uhlich, Yugu Zeng and Bin Yang Abstract We compare two segmentation approaches to plosive detection: One aproach is using a uniform

More information

Speaking in Minor and Major Keys

Speaking in Minor and Major Keys Chapter 5 Speaking in Minor and Major Keys 5.1. Introduction 28 The prosodic phenomena discussed in the foregoing chapters were all instances of linguistic prosody. Prosody, however, also involves extra-linguistic

More information

SHORT TERM PITCH MEMORY IN WESTERN vs. OTHER EQUAL TEMPERAMENT TUNING SYSTEMS

SHORT TERM PITCH MEMORY IN WESTERN vs. OTHER EQUAL TEMPERAMENT TUNING SYSTEMS SHORT TERM PITCH MEMORY IN WESTERN vs. OTHER EQUAL TEMPERAMENT TUNING SYSTEMS Areti Andreopoulou Music and Audio Research Laboratory New York University, New York, USA aa1510@nyu.edu Morwaread Farbood

More information

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1 First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1 Zehra Taşkın *, Umut Al * and Umut Sezen ** * {ztaskin; umutal}@hacettepe.edu.tr Department of Information

More information

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs Abstract Large numbers of TV channels are available to TV consumers

More information

Composer Style Attribution

Composer Style Attribution Composer Style Attribution Jacqueline Speiser, Vishesh Gupta Introduction Josquin des Prez (1450 1521) is one of the most famous composers of the Renaissance. Despite his fame, there exists a significant

More information

Improving Piano Sight-Reading Skills of College Student. Chian yi Ang. Penn State University

Improving Piano Sight-Reading Skills of College Student. Chian yi Ang. Penn State University Improving Piano Sight-Reading Skill of College Student 1 Improving Piano Sight-Reading Skills of College Student Chian yi Ang Penn State University 1 I grant The Pennsylvania State University the nonexclusive

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,

More information

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Mohamed Hassan, Taha Landolsi, Husameldin Mukhtar, and Tamer Shanableh College of Engineering American

More information

AUTOMATIC MAPPING OF SCANNED SHEET MUSIC TO AUDIO RECORDINGS

AUTOMATIC MAPPING OF SCANNED SHEET MUSIC TO AUDIO RECORDINGS AUTOMATIC MAPPING OF SCANNED SHEET MUSIC TO AUDIO RECORDINGS Christian Fremerey, Meinard Müller,Frank Kurth, Michael Clausen Computer Science III University of Bonn Bonn, Germany Max-Planck-Institut (MPI)

More information

Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals. By: Ed Doering

Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals. By: Ed Doering Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals By: Ed Doering Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals By: Ed Doering Online:

More information

Temporal coordination in string quartet performance

Temporal coordination in string quartet performance International Symposium on Performance Science ISBN 978-2-9601378-0-4 The Author 2013, Published by the AEC All rights reserved Temporal coordination in string quartet performance Renee Timmers 1, Satoshi

More information

Analysis of the effects of signal distance on spectrograms

Analysis of the effects of signal distance on spectrograms 2014 Analysis of the effects of signal distance on spectrograms SGHA 8/19/2014 Contents Introduction... 3 Scope... 3 Data Comparisons... 5 Results... 10 Recommendations... 10 References... 11 Introduction

More information

Figures in Scientific Open Access Publications

Figures in Scientific Open Access Publications Figures in Scientific Open Access Publications Lucia Sohmen 2[0000 0002 2593 8754], Jean Charbonnier 1[0000 0001 6489 7687], Ina Blümel 1,2[0000 0002 3075 7640], Christian Wartena 1[0000 0001 5483 1529],

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

Pitfalls and Windfalls in Corpus Studies of Pop/Rock Music

Pitfalls and Windfalls in Corpus Studies of Pop/Rock Music Introduction Hello, my talk today is about corpus studies of pop/rock music specifically, the benefits or windfalls of this type of work as well as some of the problems. I call these problems pitfalls

More information

EFFECT OF REPETITION OF STANDARD AND COMPARISON TONES ON RECOGNITION MEMORY FOR PITCH '

EFFECT OF REPETITION OF STANDARD AND COMPARISON TONES ON RECOGNITION MEMORY FOR PITCH ' Journal oj Experimental Psychology 1972, Vol. 93, No. 1, 156-162 EFFECT OF REPETITION OF STANDARD AND COMPARISON TONES ON RECOGNITION MEMORY FOR PITCH ' DIANA DEUTSCH " Center for Human Information Processing,

More information

ONLINE ACTIVITIES FOR MUSIC INFORMATION AND ACOUSTICS EDUCATION AND PSYCHOACOUSTIC DATA COLLECTION

ONLINE ACTIVITIES FOR MUSIC INFORMATION AND ACOUSTICS EDUCATION AND PSYCHOACOUSTIC DATA COLLECTION ONLINE ACTIVITIES FOR MUSIC INFORMATION AND ACOUSTICS EDUCATION AND PSYCHOACOUSTIC DATA COLLECTION Travis M. Doll Ray V. Migneco Youngmoo E. Kim Drexel University, Electrical & Computer Engineering {tmd47,rm443,ykim}@drexel.edu

More information

CSC475 Music Information Retrieval

CSC475 Music Information Retrieval CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0

More information

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

Available online at  ScienceDirect. Procedia Computer Science 46 (2015 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 46 (2015 ) 381 387 International Conference on Information and Communication Technologies (ICICT 2014) Music Information

More information

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014 BIBLIOMETRIC REPORT Bibliometric analysis of Mälardalen University Final Report - updated April 28 th, 2014 Bibliometric analysis of Mälardalen University Report for Mälardalen University Per Nyström PhD,

More information

Subjective evaluation of common singing skills using the rank ordering method

Subjective evaluation of common singing skills using the rank ordering method lma Mater Studiorum University of ologna, ugust 22-26 2006 Subjective evaluation of common singing skills using the rank ordering method Tomoyasu Nakano Graduate School of Library, Information and Media

More information

Melody classification using patterns

Melody classification using patterns Melody classification using patterns Darrell Conklin Department of Computing City University London United Kingdom conklin@city.ac.uk Abstract. A new method for symbolic music classification is proposed,

More information

SOUND LABORATORY LING123: SOUND AND COMMUNICATION

SOUND LABORATORY LING123: SOUND AND COMMUNICATION SOUND LABORATORY LING123: SOUND AND COMMUNICATION In this assignment you will be using the Praat program to analyze two recordings: (1) the advertisement call of the North American bullfrog; and (2) the

More information

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science Visegrad Grant No. 21730020 http://vinmes.eu/ V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science Where to present your results Dr. Balázs Illés Budapest University

More information

Modeling memory for melodies

Modeling memory for melodies Modeling memory for melodies Daniel Müllensiefen 1 and Christian Hennig 2 1 Musikwissenschaftliches Institut, Universität Hamburg, 20354 Hamburg, Germany 2 Department of Statistical Science, University

More information

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed, VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS O. Javed, S. Khan, Z. Rasheed, M.Shah {ojaved, khan, zrasheed, shah}@cs.ucf.edu Computer Vision Lab School of Electrical Engineering and Computer

More information

Modeling sound quality from psychoacoustic measures

Modeling sound quality from psychoacoustic measures Modeling sound quality from psychoacoustic measures Lena SCHELL-MAJOOR 1 ; Jan RENNIES 2 ; Stephan D. EWERT 3 ; Birger KOLLMEIER 4 1,2,4 Fraunhofer IDMT, Hör-, Sprach- und Audiotechnologie & Cluster of

More information

Acoustic Echo Canceling: Echo Equality Index

Acoustic Echo Canceling: Echo Equality Index Acoustic Echo Canceling: Echo Equality Index Mengran Du, University of Maryalnd Dr. Bogdan Kosanovic, Texas Instruments Industry Sponsored Projects In Research and Engineering (INSPIRE) Maryland Engineering

More information

1 Introduction to PSQM

1 Introduction to PSQM A Technical White Paper on Sage s PSQM Test Renshou Dai August 7, 2000 1 Introduction to PSQM 1.1 What is PSQM test? PSQM stands for Perceptual Speech Quality Measure. It is an ITU-T P.861 [1] recommended

More information

A SEMANTIC DIFFERENTIAL STUDY OF LOW AMPLITUDE SUPERSONIC AIRCRAFT NOISE AND OTHER TRANSIENT SOUNDS

A SEMANTIC DIFFERENTIAL STUDY OF LOW AMPLITUDE SUPERSONIC AIRCRAFT NOISE AND OTHER TRANSIENT SOUNDS 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 A SEMANTIC DIFFERENTIAL STUDY OF LOW AMPLITUDE SUPERSONIC AIRCRAFT NOISE AND OTHER TRANSIENT SOUNDS PACS: 43.28.Mw Marshall, Andrew

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

Music Segmentation Using Markov Chain Methods

Music Segmentation Using Markov Chain Methods Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some

More information

THE DIGITAL DELAY ADVANTAGE A guide to using Digital Delays. Synchronize loudspeakers Eliminate comb filter distortion Align acoustic image.

THE DIGITAL DELAY ADVANTAGE A guide to using Digital Delays. Synchronize loudspeakers Eliminate comb filter distortion Align acoustic image. THE DIGITAL DELAY ADVANTAGE A guide to using Digital Delays Synchronize loudspeakers Eliminate comb filter distortion Align acoustic image Contents THE DIGITAL DELAY ADVANTAGE...1 - Why Digital Delays?...

More information

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications Matthias Mauch Chris Cannam György Fazekas! 1 Matthias Mauch, Chris Cannam, George Fazekas Problem Intonation in Unaccompanied

More information

White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart

White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart by Sam Berkow & Alexander Yuill-Thornton II JBL Smaart is a general purpose acoustic measurement and sound system optimization

More information

Pre-Processing of ERP Data. Peter J. Molfese, Ph.D. Yale University

Pre-Processing of ERP Data. Peter J. Molfese, Ph.D. Yale University Pre-Processing of ERP Data Peter J. Molfese, Ph.D. Yale University Before Statistical Analyses, Pre-Process the ERP data Planning Analyses Waveform Tools Types of Tools Filter Segmentation Visual Review

More information

THE EFFECT OF EXPERTISE IN EVALUATING EMOTIONS IN MUSIC

THE EFFECT OF EXPERTISE IN EVALUATING EMOTIONS IN MUSIC THE EFFECT OF EXPERTISE IN EVALUATING EMOTIONS IN MUSIC Fabio Morreale, Raul Masu, Antonella De Angeli, Patrizio Fava Department of Information Engineering and Computer Science, University Of Trento, Italy

More information

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements:

Tutorial 0: Uncertainty in Power and Sample Size Estimation. Acknowledgements: Tutorial 0: Uncertainty in Power and Sample Size Estimation Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported in large part by the National

More information

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University A Pseudo-Statistical Approach to Commercial Boundary Detection........ Prasanna V Rangarajan Dept of Electrical Engineering Columbia University pvr2001@columbia.edu 1. Introduction Searching and browsing

More information

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility Karim M. Ibrahim (M.Sc.,Nile University, Cairo, 2016) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT

More information

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY Eugene Mikyung Kim Department of Music Technology, Korea National University of Arts eugene@u.northwestern.edu ABSTRACT

More information

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts

Blueline, Linefree, Accuracy Ratio, & Moving Absolute Mean Ratio Charts INTRODUCTION This instruction manual describes for users of the Excel Standard Celeration Template(s) the features of each page or worksheet in the template, allowing the user to set up and generate charts

More information

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION H. Pan P. van Beek M. I. Sezan Electrical & Computer Engineering University of Illinois Urbana, IL 6182 Sharp Laboratories

More information

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas

Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas Machine Learning Term Project Write-up Creating Models of Performers of Chopin Mazurkas Marcello Herreshoff In collaboration with Craig Sapp (craig@ccrma.stanford.edu) 1 Motivation We want to generative

More information

Computer-based sound spectrograph system

Computer-based sound spectrograph system Computer-based sound spectrograph system William J. Strong and E. Paul Palmer Department of Physics and Astronomy, Brigham Young University, Provo, Utah 84602 (Received 8 January 1975; revised 17 June

More information

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition May 3,

More information

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite Colin O Toole 1, Alan Smeaton 1, Noel Murphy 2 and Sean Marlow 2 School of Computer Applications 1 & School of Electronic Engineering

More information

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION

EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION EXPLORING THE USE OF ENF FOR MULTIMEDIA SYNCHRONIZATION Hui Su, Adi Hajj-Ahmad, Min Wu, and Douglas W. Oard {hsu, adiha, minwu, oard}@umd.edu University of Maryland, College Park ABSTRACT The electric

More information

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE Centre for Marine Science and Technology A Matlab toolbox for Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE Version 5.0b Prepared for: Centre for Marine Science and Technology Prepared

More information

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC A Thesis Presented to The Academic Faculty by Xiang Cao In Partial Fulfillment of the Requirements for the Degree Master of Science

More information