Predicting when to Laugh with Structured Classification

ITERSEECH 04 redicting when to Laugh with Structured Classification Bilal iot, Olivier ietquin, Matthieu Geist SUELEC IMS-MaLIS research grou and UMI 958 (GeorgiaTech - CRS) University Lille, LIFL (UMR 80 CRS/Lille ) - SequeL team firstname.lastname@suelec.fr, olivier.ietquin@univ-lille.fr Abstract Today, Embodied Conversational Agents (ECAs) are emerging as natural media to interact with machines. Alications are numerous and ECAs can reduce the technological ga between eole by roviding user-friendly interfaces. Yet, ECAs are still unable to roduce social signals aroriately during their interaction with humans, which tends to make the interaction less instinctive. Esecially, very little attention has been aid to the use of laughter in human-avatar interactions desite the crucial role layed by laughter in human-human interaction. In this aer, a method for redicting the most aroriate moment for laughing for an ECA is roosed. Imitation learning via a structured classification rithm is used in this urose and is shown to roduce a behavior similar to humans on a ractical alication: the yes/no game. Index Terms: Imitation Leaning, Structured Classification.. Introduction Building efficient and user-friendly human-machine interfaces is a key challenge for the future of comuter science, enabling a large ublic to interact with comlex systems and reducing the technological ga between eole. In the last decade, Embodied Conversational Agents (ECAs) emerged as such interfaces. Yet, their behaviour is still erceived as quite unnatural to users. One of the reasons of this bad ercetion is the inability of ECAs to make a roer use of social signals, although there exists some research on this toic []. Among these signals, laughter is a rominent feature used by humans during interactions. Yet, very little attention has been aid to enable ECAs with laughter caabilities until recently []. Enabling ECAs with laughter caabilities is not only about being able to synthesize audio-visual laughter signals [3, 4]. It is also concerned by an aroriate management of laughter during the interaction. There is thus a need for a laughter-enabled interaction manager, able to decide when to laugh so that it is aroriate in the conversation. This being said, it remains uneasy to define what is an aroriate moment to laugh. More formally, the task of the laughter-enabled interaction Manager (IM) is to take decisions about whether to laugh or not. These decisions have to be taken according to the interaction context which can be inferred from laughter, seech and smile detection modules (detecting social signals emitted by the users) imlemented in the ECA but also by the task context (for examle, if the human is laying a game with the ECA, what is the status of the game). Formally, the IM is thus a module imlementing a maing between contexts (or states noted s S) and decisions (or actions noted a A). Let s call this maing a olicy, noted π(s) = a. This maing is quite difficult to learn from real data as the laughs are quite rare and very different from one user to another. In this aer, we describe the research results for learning such a maing from data, recorded during some human-human interactions, so as to imlement, in the IM, a behavior similar to the one of a human. An imitation learning method is thus adoted. Esecially, structured classification is investigated and roven to efficiently learn a behavior similar to human users where the similarity between human and rithms is measured via a new criterion called aturalness and defined in Sec. 5. In addition, we use a technique of boosting for the structured classification rithm which makes it a non-arametric rithm. This avoids the choice of meta-arameters. Finally, we test different imitation rithms on data sets of real laughs in a natural interaction context which is the yes/no game described in Sec. 4.. Imitation Learning Describing the otimal behavior of the avatar is a very tricky task. It would require the erfect knowledge of rules revailing to the generation of laughter by humans. Interreting sources of laughter or redicting laughter from a cognitive or sychology ersective is non-trivial. Therefore, a data-driven method has been referred here. Esecially, learning by imitation seems the best suited framework to learn the IM olicy. Indeed, humans are imlementing such a olicy and they can rovide examles of natural behaviors. Formally, in the learning by imitation framework, an artificial learning agent (here the IM) learns to behave otimally by observing some exert agent demonstrating the task. The exert is imlementing an otimal olicy noted π E and the demonstrations rovide a set of examles {s i, a i = π E (s i)} { i }. The roblem is thus to learn a olicy ˆπ such that s, ˆπ(s) π E (s) from the set of demonstrations. One way to address the roblem of imitation learning is to reduce it to a Multi-Class Classification (MCC) roblem [5, 6, 7, 8]. The goal of MCC is, given a training set D = (s i S, a i A) { i } where S is a comact set of inuts (generally a comact set of R n ) and A a finite set of labels, to find a decision rule π A S that generalizes the relation between inuts and labels. More formally, it consists in finding a decision rule π H, where H R S A is called the hyothesis sace, that tries to minimize the following emirical risk: T (π) = {π(si ) a i }. () i= where {a b} = if a b and 0 otherwise. A large literature already exists about the MCC roblem. Well known methods such as Classification Trees [9], K- earest eighbors (K) [0] and Suort Vector Machines (SVM) [, ] are widely used and statistically studied. In [5], Coyright 04 ISCA 786 4-8 Setember 04, Singaore

the authors use an artificial neural network to learn a driving olicy for a robotic vehicle. eural nets are also used in [7] to learn to lay video games (although the method is more generic and could use other MCC methods). K s where used in [] in a similar alication as the one described in this aer. In [6], structured classification [3] is used to learn a grasing control olicy for a robotic arm. 3. Structured Classification for Imitation Learning In [6], the authors use a large margin aroach which allows adding some rior (or structure) via a margin function in the classification method. That is why it is considered as a structured classification method. The large margin aroach is a score-based MCC where the decision rule π A S is obtained via a score function q R S A such that s S, π(s) argmax a A q(s, a). The large margin aroach consists, given the training set D, in solving the following otimization roblem: q = argmin q R S A J(q), () J(q) = i= max{q(si, a) + l(si, ai, a)} q(si, ai), (3) a A where l R S A A + is called the margin function. If it is zero, minimizing J(q) attemts to find a score function q for which the examle labels are scored higher than all other labels. Choosing a nonzero margin function imroves generalization [6]. Instead of requiring only that the demonstrated label is scored higher than all other labels, it requires it to be better than each label a by an amount given by the margin function. Thus, the margin function allows deciding which samles are required to be well classified by utting an imortant margin on this articular examle comared to the others. The olicy oututted by this rithm would be π(s) argmax a A ˆq(s, a) where ˆq is the outut of the minimization of J(q). The advantages of this method are its simlicity and the ossibility to change the margin that allows us to adat to secific characteristics of the roblem. In addition, in [4], the authors use a boosting technique to solve the otimization roblem given by Eq. () which is advantageous. A boosting method is an interesting otimization technique: it minimizes directly the criterion in Eq. without the ste of choosing features. As resented in [5], a boosting rithm is a rojected sub-gradient descent [6] of a convex functional (here J is convex relatively to the variable q) in a secific functions sace (here R S A ) which has to be a Hilbert sace. Boosting rithms use a rojection ste on a restriction set of functions when otimizing over functions sace, because the functions reresenting the gradient are often comutationally difficult to maniulate and do not generalize well to new inuts [5]. In boosting literature, the restriction set corresonds directly to the set of hyotheses generated by a weak learner. In our exeriments, we choose as restriction set the set of classification trees with two classes. 4. Exerimental Setu The yes/no game is one of the ossible scenarios of an interaction between humans and avatars where laughter is involved. In this game, layers must resond to questions without saying yes or no. The exeriment scenario we resent in this article is illustrated in Fig.. Two users are sitting on one side of a Figure : Exerimental setu. table while a virtual agent rojected on a large screen is laced on the oosite side of the table. The users start to lay the yes/no game, one asking questions (e.g., what s your nationality?, are you sure? ), this user is named U, and the other one answering trying to avoid to say yes or no (e.g., I m not sure or definitely ), this user is named U. The avatar, named A, articiates to the interaction by laughing and asking questions. Of course, U and A try to make U to say yes or no and thus try to induce a loss of self-control. At any oint, laughter can occur for any articiant. The avatar has to generate laughter at aroriate moments given its ercetion of the context. As shown in Fig., detection of humans laughter is erformed through body (Kinect and body markers), face (Kinect) and seech (head mounted microhones) analysis [7]. Several recognition rithms are executed in real-time to determine users exressivity of motion. In order to train our avatar by an imitation learning rithm, several exeriments are first recorded, where the avatar (symbolized by a screen in Fig. ) is relaced by a human laying the role of the avatar (this is the exert we want to imitate). The same detection material as for the two other articiants is used for the human laying the role of the avatar. Thanks to those recordings an exert data set D = {s i, a i = π E (s i)} { i } is generated which is the inut of an imitation learning rithm. Indeed, for each user (U and U ), the recognition rithms are able to extract each 0.5s 4 features which are real values between 0 and. The 4 features are the robability of seech, the robability of laughter, the intensity of laughter and the robability of smile. Moreover, another feature, which reresents the context of the game, is added by annotation of the recordings: 0 when the game is currently ongoing and when it ends (that is when U said yes or no or that some time-out occurred). Thus each 0.5s, we are rovided 9 features (4 features for U, 4 features for U and the context) that reresents the state of the game s i. Finally, by annotations of the recordings, we rovide each 0.5s a binary information ( or 0) giving the decision of the exert (a i): laugh/ (so it is a actions decision rocess). A samle (s i, a i) where a i = 0 corresonds to a 787

Algorithms Global laugh Large Margin 0.687 0.356 0.808 SVM 0.673 0.366 0.7893 K 0.5440 0.6347 0.573 Tree 0.5570 0.566 0.573 Table : Classification rates. Figure : Real Demo. samle and a samle (s i, a i) where a i = corresonds to a laugh samle. In addition, we also collect, by annotations, the binary laugh/ information for U and U : (a U i, a U i ) { i }. ow that we have the exert data set, it is ossible to use it as an inut to different imitation learning rithms. 5. Results In this section, we resent the results obtained by alying different imitation learning rithms to the exert data set. We use 4 different rithms, 3 classical classification rithms, which are K, Classification Tree and SVM, and the largemargin rithm resented in Section 3. The K rithm was reviously used in [] where K =, here we do the same choice in order to comare to other methods. The SVM rithm uses a Gaussian kernel with a standard deviation σ = and the Classification Tree is a runed binary classification tree. For the large margin aroach, we choose a margin with a articular structure that favors the samles more than the laugh samles so as to only synthesized laughter when it is really aroriate. Indeed, laughing at inaroriate moments seems awkward for humans and it is imortant to avoid that while not laughing is not too roblematic in this alication. Thus, we choose the following margin structure: l(s i, a i, a) = 0 if a = a i, (4) l(s i, a i, a) = 6 if a a i and a i = 0, () (5) l(s i, a i, a) = if a a i and a i =, (laugh). (6) Eighteen minutes of recordings were collected in three sessions where the game was layed several times (at least twice by recordings). This rovided an exert data set D = {s i, a i = π E (s i)} { i } of 85 examles (that is the number of 0.5s frames). The 4 rithms were trained on this data set. In order to comare the erformances of the rithms, we use a -fold cross validation. In -fold crossvalidation, the original data D is artitioned into equal size sub-samles D = (D ) { }, where D = {s j,, a j, = π E (s j,)} { j } and = =. Of the subsamles, a single sub-samle is retained as the validation data for testing the rithm, and the remaining sub-samles are used as training data. The cross-validation rocess is then reeated times (the folds), with each of the sub-samles used exactly once as the validation data. The results from the folds then can be averaged to roduce a single estimation. For each sub-samle D and each rithm, we define the olicy π A S learned on the remaining sub-samles. In addition, we define, for each sub-samle D, the number of laugh samles laugh = j= {a j, =} and the number of samles. Several quality evaluation criteria were used for each rithm. The first criterion is the mean over the folds of the global classification rate: = j= = {π (s j, )=a j, }. (7) The second criterion is the mean over the folds of the classification rate on laugh samles: = j= {π (s j, )=a j, } {a j, =}. (8) The third criterion is the mean over the folds of the classification rate on the samles: = j= {π (s j, )=a j, } {a j, =0}. (9) We choose those different criteria in order to see the quality of each rithm on the laugh samles and the samles because those two classes are not well balanced (basically there is 5 times more samles than laugh samles). In Table, we have the results of the different rithms in terms of classification rates with = 5. The Large Margin has the best results for the global classification rate and the rate. The structure of the margin favors the erformance on the samles and it is reflected in the results. K works well on the laugh samles which is also the case of the Classification Tree but has a really oor global erformance. It seems that the avatar is too reactive (laughs too often) which can be roblematic if the laughs haen on inaroriate moments: this behavior aears unnatural. In order to check if the good erformance on laughs of K is due to the fact that it is too reactive, we comuted the number of laughs roduced for each fold and take the mean. Results are rovided in Table. The Classification Tree and the K avatar are too reactive which can exlain their good erformance on laughs but their behavior is not natural comared to the exert. The most natural behavior is the one roduced by the Large Margin rithm The variable can take the values K, SVM, LargeMargin and Tree. 788

Algorithms umber of laughs in average Exert.6 Large Margin 5.4 SVM 7.4 K 35.4 Tree 5. Table : Comarison of laughs numbers. which laughs in the same roortion than the exert. So the classification rates are not aroriate measures to assess the rithms according to this alication. For this reason, we came u with a measure for naturalness which indicates if the olicy roduced by the rithm corresonds to the behavior of the exert. The idea is to comare if relatively to the two other users the human laying the avatar and the rithm have the same behavior. In order to see if there is a similarity between the behavior of the user laying the avatar A exert and the one learnt by the rithms A, we check if the behavior of the exert A exert comares to the users ((U q) {q=,} ) similarly to the way the avatar s behaviour A comares to the users ((U q) {q=,} ). The idea is to show that the avatar doesn t differ more from U and U than the exert does. To do so, for each user U q and each sub-samle D, we define the number of laugh samles,u q = j= {a Uq and the number of =} samles,u q j,. Three criterions were used: the global rate, the laugh rate and rate. The global criterion rate avatar is the rate of agreement in terms of actions between one of the user and an avatar samle by samle: q= =,U q = j= {π avatar (s j, )=a Uq j, }, (0) where π Exert (s j,) = π E (s j,) = a j,. The laugh criterion rate avatar gives the rate of agreed laughs between the avatar and one of the users: q= =,U q j= {π avatar (s j, )=a Uq } j, {a Uq =}. j, () The criterion rate 3 avatar gives the rate of agreed no laughs between the avatar and one of the users: q= =,U q j= {π avatar (s j, )=a Uq } j, {a Uq =0}. j, () In order to have a single number reresenting the similarity between the exert avatar A and the avatars oututted by the rithms, a new criterion, called aturalness, is defined as follows: = 3 min(rate i, rate i Exert) i= max(rate i, rate i Exert) (3) This criterion is thus a measure of the deviation between the behavior of the exert avatar and the behavior learnt by a given rithm. If the aturalness is equal to, it means that the Algorithms Global rate Laugh rate o Laugh rate Exert 0.7079 0.4503 0.7649 Large Margin 0.739 0.486 0.787 SVM 0.783 0.536 0.7756 K 0.5096 0.85 0.4407 Tree 0.585 0.5858 0.563 Table 3: Rates used for aturalness. Algorithms aturalness Exert Large Margin 0.9 SVM 0.876 K 0.39 Tree 0.3874 Table 4: aturalness. avatar has the same behavior as the exert relatively to the other users and if it is equal to zero, it means that the avatar has a comletely different behavior than the exert. Table 4 gives the results. The Large Margin method clearly outerforms the other ones, which means that its behavior relatively to the other users corresonds closely to the one of the exert. We see that the K and the Tree have oor aturalness as they laugh too much relatively to the other users which is not what the exert does. 6. Conclusion and ersectives In this aer, a method for learning when an avatar should laugh during an interaction with humans was resented. It is based on a data-driven imitation learning rithm and esecially on structured classification method. The structured margin imlied in this method is used to weight the imortance of laughter comared to silence so as to generate a more natural behaviour and deal with the unbalanced nature of data. It is shown, in a yes/no game setting, that the method outerforms other classification methods in terms of overall similarity with a human. Comared to revious exerimentations [], this method objectivelly rovides better results in terms of a newly introduced criterion. Here, imitation learning is reduced to a multiclass classification roblem. Yet, imitation learning can also be solved by other methods such as inverse reinforcement learning [8, 9]. Actually, this method has been shown to work better for some tyes of roblems [0] and has already been used to imitate human users in the case of soken dialogue systems []. Therefore, we lan to extend this work to inverse reinforcement learning in the near future. Also, this method could be used to generate new simulation techniques for otimizing human machine interaction managers in other alications such as soken dialogue systems [, 3]. 7. Acknowledgements The research leading to these results has received funding from the Euroean Union Seventh Framework rogramme (F7/007-03) under grant agreement number 70780. The variable avatar can take the values Exert, K, SVM, LargeMargin and Tree 789

8. References [] Marc Schröder, Elisabetta Bevacqua, Roddy Cowie, Florian Eyben, Hatice Gunes, Dirk Heylen, Mark ter Maat, Garry McKeown, Satish ammi, Maja antic, Catherine elachaud, Björn Schuller, Etienne de Sevin, Michel Valstar, and Martin Wöllmer, Building autonomous sensitive artificial listeners, IEEE Transactions on Affective Comuting, vol. 3, no.,. 65 83, 0. [] Radoslaw iewiadomski, Jennifer Hofmann, Jérome Urbain, Tracey latt, Johannes Wagner, Bilal iot, Huseyin Cakmak, Sathish ammi, Tobias Baur, Stéhane Duont, Matthieu Geist, Florian Lingenfelser, Gary McKeown, Olivier ietquin, and Willibald Ruch, Laugh-aware virtual agent and its imact on user amusement, in roceedings of the Twelfth International Conference on Autonomous Agents and Multiagent Systems (AA- MAS 03), Saint aul, USA, May 03,. 69 66. [3] Radoslaw iewiadomski, Sathish ammi, Abhishek Sharma, Jennifer Hofmann, Tracey, Richard Thomas Cruz, and Bingqing Qu, Visual laughter synthesis: Initial aroaches, in roceedings of the Interdiscilinary Worksho on Laughter and other on-verbal Vocalisations, Dublin, Ireland, October 0,. 0. [4] Jérôme Urbain, Hüseyin Cakmak, and Thierry Dutoit, Evaluation of HMM-based laughter synthesis, in roceedings of the 38th International Conference on Acoustics, Seech, and Signal rocessing (ICASS 03), Vancouver, Canada, May 03,. 7835 7839. [5] Dean omerleau, Alvinn: An autonomous land vehicle in a neural network, in Advances in eural Information rocessing Systems (IS 988), Vancouver, Canada, December 988,. 305 33. [6] athan Ratliff, J. Andrew Bagnell, and Siddhartha S. Srinivasa, Imitation learning for locomotion and maniulation, in roceedings of the 7th IEEE-RAS International Conference on Humanoid Robots, ittsburg, USA, ovember 007,. 39 397. [7] Stéhane Ross and J. Andrew Bagnell, Efficient reductions for imitation learning, in roceedings of the thirteenth International Conference on Artificial Intelligence and Statistics (AIS- TATS 00), Sardinia, Italy, May 00, vol. 9 of JMLR Worksho and Conference roceedings,. 66 668. [8] Umar Syed and Robert E. Schaire, A reduction from arenticeshi learning to classification, in Advances in eural Information rocessing Systems (IS 00), Vancouver, Canada, December 00,. 53 6. [9] Leo Breiman, Jerome H Friedman, Richard A Olshen, and Charles J Stone, Classification and regression trees, Chaman & Hall, Monterey, CA, 984. [0] Thomas M. Cover and eter E. Hart, earest neighbor attern classification, IEEE Transactions on Information Theory, vol. 3, no.,. 7, 967. [] Vladimir Vanik, Statistical learning theory, Wiley, 998. [] Yann Guermeur, A generic model of multi-class suort vector machine, Journal of Intelligent Information and Database Systems, vol. 6, no. 6,. 555 577, October 0. [3] Ben Taskar, Vassil Chatalbashev, Dahne Koller, and Carlos Guestrin, Learning structured rediction models: a large margin aroach, in roceedings of the nd International Conference on Machine Learning (ICML 005), ew York, Y, USA, 005,. 896 903. [4] athan Ratliff, David Bradley, J Andrew (Drew) Bagnell, and Joel Chestnutt, Boosting Structured rediction for Imitation Learning, in Advances in eural Information rocessing Systems (IS 007), Vancouver, Canada, December 007. [5] Alexander Grubb and J.Andrew Bagnell, Generalized boosting rithms for convex otimization, in roceedings of the 8th International Conference on Machine Learning (ICML0), 0. [6] aum.z. Shor, Krzysztof.C. Kiwiel, and Andrzej Ruszcaynski, Minimization methods for non-differentiable functions, Sringer- Verlag, 985. [7] Johannes Wagner, Florian Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, and Elisabeth André, The social signal interretation (SSI) framework: multimodal signal rocessing and recognition in real-time, in roceedings of the st ACM international conference on Multimedia (MM 03), ew York, Y, USA, October 03,. 83 834. [8] Stuart Russell, Learning agents for uncertain environments (extended abstract), in roceedings of the eleventh annual conference on Comutational learning theory (COLT 98), Madison, Wisconsin, USA, July 998,. 0 03. [9] Edouard Klein, Matthieu Geist, Bilal iot, and Olivier ietquin, Inverse reinforcement learning through structured classification, in Advances in eural Information rocessing Systems (IS 0), South Lake Tahoe, USA, December 0,. 06 04. [0] Bilal iot, Matthieu Geist, and Olivier ietquin, Learning from demonstrations: Is it worth estimating a reward function?, in roceedings of the Euroean Conference on Machine Learning and rinciles and ractice of Knowledge Discovery in Databases (ECML/KDD 03), rague (Czech Reublic), Setember 03, vol. 888 of Lecture otes in Comuter Science,. 7 3, Sringer. [] Senthilkumar Chandramohan, Matthieu Geist, Fabrice Lefèvre, and Olivier ietquin, User simulation in dialogue systems using inverse reinforcement learning, in roceedings of the th Annual Conference of the International Seech Communication Association (Interseech 0), Florence, Italy, August 0,. 05 08. [] Olivier ietquin and Thierry Dutoit, A robabilistic Framework for Dialog Simulation and Otimal Strategy Learning, IEEE Transactions on Audio, Seech and Language rocessing, vol. 4, no.,. 589 599, March 006. [3] Olivier ietquin and Helen Hastie, A survey on metrics for the evaluation of user simulations, Knowledge Engineering Review, vol. 8, no. 0,. 59 73, February 03. 790