Universität Bamberg Angewandte Informatik. Seminar KI: gestern, heute, morgen. We are Humor Beings. Understanding and Predicting visual Humor

Universität Bamberg Angewandte Informatik Seminar KI: gestern, heute, morgen We are Humor Beings. Understanding and Predicting visual Humor by Daniel Tremmel 18. Februar 2017 advised by Professor Dr. Ute Schmid

Abbildung 1: An Example Scene of the Dataset. Although there already is some research done in the field of humor in artificial intelligence, there are only a few works dealing with the issue of visual humor in artificial intelligence. The reason is that the problem of object recognition, which is important for a study on visual humor, isn t solved yet. The technical paper We are Humor Beings. Understanding and Predicting visual Humor is one of the few research papers on visual humor in the field of artificial intelligence. This paper will outline the most important points of the study, including the research approach, the features that have been used and the results of the study. 1 Introduction Until now, there has not been much research in the field of visual humor in artificial intelligence. The technical paper We are humor Beings. Understanding and Predicting Visual Humor is one of the few approaches to deal with the question wether it is possible to model visual humor in artificial intelligence. The problem so far was, that for an understanding of visual humor it is vital to recognize all objects within a scene, to distinguish them from each other and to learn in which ways they are interacting with each other. All of these challenges, however, haven t been solved by artificial intelligence so far. Nevertheless the topic could become relevant in lots of areas. One could think of smart cameras, picking up the right funny moment, recommendation tools for rating funny pictures higher than others or video summarization tools, able to recognize and extract funny scenes out of longer film sequences. These are just a few examples of how artificial intelligence could make an impact in the field of visual humor. Therefore, the researchers behind We are humor Beings. Understanding and Predicting visual Humor searched a way to bypass the object recognition problem of artificial intelligence by using clipart scenes. The objects in the clipart scenes were densely annotated, so that the computational model always knew which objects were in the scene and where in the scene they were located. This made it possible to conduct a study on visual humor, using scene level features as well as object level features. The technical paper not only 1

provided a way to model and evaluate visual humor through artificial intelligence, it also made a contribution to how visual humor can be defined. In order to conduct the study, the researchers used a support vector regressor, that was being trained with the scene level and the instance level features on two different abstract datasets, that were crafted by Amazon Mechanical Turk Workers. As a success criteria the researchers formulated two tasks, that served as guideline for detecting humor. In this paper I will discuss the technological background of the project in order to provide the knowledge that is needed to understand the constraints of the project. In the main part of the paper I will discuss the research approach in detail as well as the outcomes of the project. The last chapters will serve as a summarization and a conclusion of the project. 2 Background The researchers modelled two tasks, which they considered to be appropriate guidelines for measuring humor: Rating the funniness of a scene and altering the funniness of a scene. The tasks served as success criteria to see wether the experiment was successful or not. In order to carry out those tasks they had to apply several features on the technical level, instance level features as well as scene level features. Instance level features were needed mainly for altering the funniness, whereas the scene level features were needed for measuring the funniness of a scene. One important term with regard to scene level features is support vector regression (SVR). A Support Vector Regressor was trained on the scene level features and an ablation study was performed subsequently. SVR has its origin in the statistical learning theory, also called VC theory, that was developed to enable learning machines to generalize to unseen data. Initially, the SVR algorithm was a generalized portrait algorithm developed in the sixties in Russia by Vapnik, Chervonenkis and others. SVR is widely used not only in artificial intelligence, but also in high risk financial predictions and approximation of complex engineering analyses, for example. In artificial intelligence SVR is mainly used in order to train machine learning models to predict the most probable outcomes of regularly carried out tasks.(smola and Schölköpf, 2004) The idea is that recurring patterns are extracted out of bigger data sets and therefore a certain behaviour can be modelled based upon those patterns. It achieves this by minimizing the generalization error and thus reaching a generalized behaviour. A linear regression function is computed in a high dimensional feature space, in which the input data is mapped by a nonlinear function.(debasish Basak and Patranabis, 2004) The scene level features on which a support vector regressor was trained on are made up of cardinality, location and scene embeddings. Location stands for the location of an object within the scene. The clipart scenes are subdivided by numerous vertical and horizontal lines, which are forming lots of sectors. Those lines are internal features that aren t seen by the viewer of the scenes, but they serve as orientation for the computational modell, so it recognizes where in the scene a certain object occurs. Cardinality refers to the number of instances of every instance category within the scene, whereas scene embbeddings refers to the total number of all instances in a certain scene. The picture 2

below demonstrates very well the different features: Abbildung 2: Funny Scene of the Data Set. As we can see in the picture above, there are two oaks, three trees, two benches, and three people in the scene, two of them male and one female, all being of a middle age. Also, there is the sun in the background. Therefore, the cardinality of people is three, of trees three, of oaks two, of benches two and of suns one. As the scene embeddings are the sum of all objects in the scene, its value is eleven. The whole scene is inherently covered with a grid through which the computational modell gets an understandng of location. The modell can for example see that the two persons in the right of the picture are in a sector, which is further away with regard to the y-axis whereas for the person in the left of the picture, the modell can recognize that the person is in a sector which has a lower value on the y-axis. It can further recognize, that the sun has relatively high x-value, plus it is relatively small with respect to the people in the foreground, which is an indicator that the sun is in the background of the picture. Given the cardinality, the computational modell can conclude that, due to the presence of a certain object category, and due to the number of its appearances, a scene might be funny or not. The scene embeddings can help to further recognize wether a scene is funny or not, as a scene with a higher number of objects, is more likely to have interactions between objects. If a scene has more interactions, it is more likely being funny as there must be some form of interaction between objects, so a scene can be funny. If there is no interaction, there is also no possibility that a funny situation arises. So, location, cardinality, and scene embeddings make up the scene level features, based upon which the evaluation of the funniness of the scene is made. Apart from the scene level features the researchers also used instance level features, with which they altered the funniness of the scene. The instance level features are object embedding and local embedding. Object embedding describes the distributed representation of every object in the scene through which the context of each object category can be retrieved. Basically every object in the scene has some pointers to other objects that typically appear to be around the object under consideration. For that a neural network is trained with a so called bag-of-words modell. The bag-of-words modell is the 3

representation through which the neural network learns the context of a certain object. Is is a representation method for object categorization. It is often used in natural language processing and computer vision. The idea behind it is that the occurence of each word, or each object is counted and represented in a histogram. In relation to computer vision it is helpful to get an idea of the context in which an object normally occurs. Every object has its histogramm of objects that are normally around the object. This is how the object embedding feature works in We are humor Beings. Annother instance level feature, that was used in the project was the local embedding feature. The local embedding feature is a representation of the distance of each object in the scene to the object under consideration. The picture below helps to demonstrate the instance level features. In the scene we see an old lady sitting on the couch watching TV. The scene is Abbildung 3: Funny Scene of the Data Set. annotated with black arrows and blue lines. The black arrows are pointing to the couch, the pillow and the TV. These things are objects that normally occur to be around an old lady. So the bag-of-words modell of the old lady is containing a couch, a TV and a pillow. Training the neural network with those unfunny, common scenes, the modell learns the normal context in which the objects normally appear. The blue lines represent the local embedding features. The blue lines are measuring the distance between the objects in the scene to the lady. If the distance is lower between the object and the old lady, it is more likely that there is an interaction between the two than between an object that has a higher distance to the lady. The grid, that is covering the scene also shows, how location and distance is calculated. At first, the location of an object is determined by the sector in which it occurs to be. Then the distance between the object and the other objects is determined by calculating the number of sectors that are between an object and the object under consideration. However, the scene also shows potential weeknesses of the modell. As we can see the lady is sitting on the couch and is watching TV (the fact that the TV is not showing a picture can be neglected in this context). Thus, we can see that there is an interaction between the lady and the TV. According to the local 4

embedding features, however, it is rather unlikely, that there is an interaction, as the distance between the TV and the lady is very big. So in this situation, the local embedding feature could lead to a misperceiption, that could have an impact on the modell, as the neural network is trained with those kind of scenes. Although this misperceiption could be balanced out by the object embedding feature, as it represents the normal context of objects. So, if the neural network would be trained with several scenes, in which an old lady appears together with a couch, the modell would still learn, that there might be a potential connection between an old lady and a couch So, all the features described above were used in the course of the project. They helped to create a computational modell, which should be capable of recognizing and altering visual humor. In the next chapter the research approach will be described in detail, as well as how the scene level features and instance level features were used. 3 Research Approach In order to train a neural network, the researchers created two datasets, the Abstract Visual Humor Dataset (AVH) and the Funny Object Replaced (FOR) Dataset. The datasets were created by Amazon Mechanical Turk Workers (AMT). The scenes were developed with a clipart interface, consisting of 20 deformable human models, 31 animal models and about 100 indoor and outdoor objects. The human models contain different genders, races and ages and different face expressions. The abstract visual humor dataset consisted of about 3200 funny scenes and about 3200 unfunny scenes. Out of the funny scenes dataset the workers also created the FOR dataset consisting out of about 15000 scenes. For the FOR dataset, the workers were asked to change some objects of the funny scenes in order to make them unfunny. In the next paragraph, the AVH dataset will be explained. 3.1 Abstract Visual Humor Dataset For the AVH dataset the AMT were told to create realistic funny scenes, that could happen in a daily context. Through this condition they wanted to prevent the AMT from creating scenes that require insider s knowledge. After creating the scenes, the workers were told to write a short description of the scene and why they consider it to be funny. The intention here was, that the workers care more about the humor of the scene they created and are even more careful not to create scenes with insider s humor. Then, the workers were told to create annother unfunny dataset of every day scenes. In the next step the scenes had to be rated. As humor is a highly subjective phenomenon, the funniness of the scenes had to be rated by other workers. This helped to create objective measurements as the worker who created a scene could t rate it on his own. The guideline was that ten workers had to give a rating between one and five, where five was extremely funny and one was not funny at all. If the average rating of the scene was above the threshold, the scene was getting into the funny scenes dataset, if it was below, it was put into the unfunny scenes dataset. In the end there were 522 unintentionally 5

funny scenes and 682 unintentionally unfunny scenes. This shows, that the approach to let other workers do the rating of the scene actually has an impact on the datasets. Annother technique that should help to give an better understanding was annotating the scenes with different humor techniques. The researchers created a list of different humor techniques that are based upon existing humor theories. The humor theories were made partly by personal observation and partly by the known humor theories of Bujizen and Vandenburg (Buijzen and Valkenburg, 2004). The typologies used were for example person doing something unusual, animal doing something unusual, clownish behaviour, etc. The workers were instructed to label the scenes with those typologies.the scenes below show the top voted techniques that were applied during the process. An Abbildung 4: Top voted scenes by humor technique. From left to right: animal doing something unusual, person doing something unusual, somebody getting hurt, and somebody getting scared. interesting insight was that, all techniques that involved animate objects were voted higher in terms of funniness than pictures including inanimate objects. In about 75% of all scenes the workers picked either animal doing something unusual, or person doing something unusual. In this case we can speak of incongruity that was applied to the scene. Incongruity takes place, when objects appear in a context that is unusual for them. An example would be an old man playing football. Therefore incongruity can be used very well for altering the humor of a scene. For that, the FOR dataset was used. 3.2 Funny Object Replaced Dataset The FOR dataset was created for the study of humor on the object level. For the FOR dataset the researchers asked the workers to alter the funny scenes of the AVH dataset by replacing as few objects as possible, in order to make the scenes unfunny. The intention was, by changing as few things as possible, the researchers would gain a very fine-grained understanding of what objects cause a scene to be funny and why. The workers were also told not to alter the underlying structure of the scene, so they shouldn t change the relation between the objects to each other, or the context of the scene. The altering of the scene should take place exclusively on the object level. The FOR dataset consists of approximately 15000 scenes, all of which were created out of the AVH dataset. For each scene of the funny part of the AVH dataset five counterparts for the FOR dataset were created. 6

3.3 Predicting the Funniness Score In order to predict the funniness score of the scene, the researchers used a Support Vector Regressor, that was trained to regress to the Funniness Score F i from the ratings given by the workers. Based upon the ratings the scene level features are applied and an ablation study is conducted. In order to measure the success of the experiment, the researchers needed a guideline after which they could measure the success. Given the following formula the success was measured by relative error: 1 N n i=1 = P redictedf i GroundT ruthf i GroundT ruthf i (1) In the formula N stands for the number of test scenes and F i stands for the predicted funniness score for a given test scene i. The ground truth is the funniness score given by the workers for the given test scene. The experiment is measured against the baseline model, which always predicts the average funniness score of the training scenes. We will discuss the results of the experiment in the result section. 3.4 Altering the Funniness of a scene The model should be capable of altering the funniness of a scene in both directions. It should be able to make a funny scene unfunny and vice versa. The researchers considered this as an appropriate guideline for measuring the model s understanding of humor. In order to be able to alter the funniness of the scene two requirements came up: In the first step, the model should propose which objects in the scene should be replaced. In the second step a potential replacer object should be proposed. 3.4.1 Predicting objects to be replaced For each scene object the model has to make a binary prediction on wether the object should be replaced or not. For that task a multi-layer perceptron was trained to make the prediction on each object. The predictions happen as a naive prediction, measuring the overall accuracy, measured against the human predictions as well as as a class wise prediction. In order to see if the model succeeds in recognizing the objects to be replaced, the model needs to be succesful both in a naive measurement as well as in a class-wise measurement. The researchers had two baselines. Priors and anormaly detection. The priors were that an instance should not be replaced. There is also one baseline computed that an object should only be replaced if it is replaced in T% of the training data. Based on the validation set T was set to 20. The anormaly detection works in a way, that from the object under consideration, the scene embedding is subtracted from the object embedding. The objects that have the least similarity are considered the anomalous objects. All objects that have a cosine similarity that is less than the threshold T are considered anomalous objects and are therefore replaced. 7

3.4.2 Proposing a replacer object After the model decided which objects should be replaced, it should propose an appropriate replacer object. When it comes to altering the funniness of a scene, the model is trained with the ground truth annotations of objects that were replaced by humans in the corresponding scene. This is necessary, so that the performance of the model in proposing the object to be replaced and the performance of the model proposing a replacer object can be measured seperately from each other. For making the scene unfunny the researchers used a so called top-5 metric. This means if any of the best predictions is matching with the ground truth, it is considered correct. As baselines there are again priors and anomaly detection. The priors is that every object is replaced by one of the most frequent replacers of the training set. For the anomaly detection the object that is to be replaced is subtracted from the scene embedding.the five objects that are most similar with respect to the object embedding feature are the ones that are proposed as replacers. 3.5 Results The aim of the study was to find out wether it is possible to create a computational model that was able to recognize humor, and to alter humor. The concrete tasks that the model was approached with were to rate the funniness of a scene and to alter the funniness of a scene. In this section the results of the study will be presented. 3.5.1 Results: Rating the funniness of a scene In section 3.3 the metric was presented, with which the success of the model was measured. The formula presented there represents the success of the model measured by its relative error. The results of the measurement are presented in the table below: As we Abbildung 5: The results of the measurement of rating the funniness of a scene, presented by its relative error. can see, all scene level features perform better than the average baseline model. We can further see, that the location feature shows the best performance of all features. The location has a relative error of 0.2400, which is considerably better than the baseline model against which the measurements were taken. All three features combined also have a relative error of 0.2400, which is however due to redundancy. So, if all three features are combined it always would end up having the measurement of the best single feature. 8

What can clearly be said according to the results is, that the first task that the model was tested with could be achieved, by the model. According to the table, the model was able to recognize humor and rate the funniness of a scene. 3.5.2 Results: Altering the funniness of a scene When it came to altering the funniness of a scene, the challenge was, that the model should be able to alter the funniness in both directions. In the first sight, it appears easier for the model to make a funny scene unfunny, than vice versa. However the model shows good results in both tasks. Below we can see a picture that the model altered in a way to make it unfunny. In both scenes there are some objects that were changed in order to make it unfunny. In the upper scene the model exchanged an eagle stealing a steak with a butterfly and a ball. Also the old man was exchanged with s little boy. Additionally the little boy was put into the background. Through this alternation, the context of the scene was changed completely. In the first scene the funniness was made up through the fact that there are two people having a barbeque and suddenly an eagle appears and steals a steak from the grill. By exchanging the eagle and the steak with a butterfly and a ball and by exchanging the old man with a young boy, the context, that created the funniness of the scene was altered completely. In the scene below it is Abbildung 6: The scenes on the left were the original funny scenes. The scenes on the right were the altered unfunny scenes. basically the same, although it isn t really clear why the scene is actually funny. The results of making funny scenes unfunny confirm the impression that the scenes above give. The model achieved a very good overall result altering the funniness Score F i to 1.64, which is clearly below the funniness score of the input scenes, which was 2.69. For altering the scenes from unfunny to funny, the researchers used the original FOR dataset with the unfunny counterparts of the original funny scenes. The funniness score of the scenes made funny by the model was 2.14, which is actually lower than the funniness score of the original funny scenes, but still a very decent score. An interesting fact is 9

also, that the model s funny scenes were considered to be more funny than the original funny scenes, created by the workers in about 28% of the scenes. 4 Conclusion The aim of the study was to create a model, that was able to create and recognize humor. So far there hasn t been much research in the area of artificial intelligence and visual humor. One reason was that the problem of object recognition, which is vital for a study on visual humor, isn t solved, yet. The researchers of We are Humor Beings bypassed the problem by using clipart scenes. In order to measure the success of their model the researchers modelled the tasks of evaluating the funniness of a scene and altering the funniness of a scene. The results for both tasks show, that the model was successful. Given the fact that this research was quite new and that the total amount of research done in the field is quite rare compared to other areas, the research can be considered a milestone on the road to creating a humorous computer model. The major achievement of the research was, that it managed to teach a computational model to recognize context and interactions within a scene, which is absolutely necessary for the understanding of visual humor. However, this research can only be one of the first steps that have to be taken for the understanding of visual humor and for creating a computational model for it. The crux on every research on humor and artificial intelligence will be that humor itself is so complex, that there isn t really a single definition of humor and what humor is. Keeping that in mind, further research should be done in this field. Although this research managed to teach a computational model some sort of understanding of context and interaction, further research in this direction is vital for a better understanding of visual humor at all and for visual humor and artificial intelligence. 10

Literatur M. Buijzen and P. M. Valkenburg. 2004. Developing a typology of humor. Media Psychology 2, 4 (2004). Srimanta Pal Debasish Basak and Dipak Chandra Patranabis. 2004. Support Vector Regression. Neural Information Processing. Letters and Reviews 11, 10 (2004), 199 222. Alex J. Smola and Bernhard Schölköpf. 2004. A tutorial on support vector regression. Statistics and Computing 14 (2004), 199 222. 11