arxiv: v1 [cs.cl] 16 Aug PDF Free Download

Sememe Preiction: Learning Semantic Knowlege from Unstructure Textual Wiki Descriptions Wei Li, Xuancheng Ren, Damai Dai, Yunfang Wu, Houfeng Wang, Xu Sun MOE Key Laboratory of Computational Linguistics, School of Electronics Engineering an Computer Science, Peking University liweitj47,renxc,aiamai,wuyf,wanghf,xusun@pku.eu.cn arxiv:1808.05437v1 [cs.cl] 16 Aug 2018 Abstract Huge numbers of new wors emerge every ay, leaing to a great nee for representing them with semantic meaning that is unerstanable to NLP systems. Sememes are efine as the minimum semantic units of human languages, the combination of which can represent the meaning of a wor. Manual construction of sememe base knowlege bases is time-consuming an labor-intensive. Fortunately, communities are evote to composing the escriptions of wors in the wiki websites. In this paper, we explore to automatically preict lexical sememes base on the escriptions of the wors in the wiki websites. We view this problem as a weakly orere multi-label task an propose a Label Distribute seq2seq moel (LD-seq2seq) with a novel soft loss function to solve the problem. In the experiments, we take a real-worl sememe knowlege base HowNet an the corresponing escriptions of the wors in Baiu Wiki 1 for training an evaluation. The results show that our LD-seq2seq moel not only beats all the baselines significantly on the test set, but also outperforms amateur human annotators in a ranom subset of the test set. 1 Introuction With the evelopment of the Internet, new wors are emerging at an unpreceente spee. It is ifficult for Natural language processing (NLP) systems to unerstan these new wors or phrases without auxiliary information with limite contexts. Fortunately, many volunteers in the community are evote to constructing the wiki pages for many of the new wors an phrases, which makes wiki websites like Wikipeia 2 an Baiu Wiki 1 very valuable resources. However, the escriptions in the wiki pages are epicte in natural 1 https://baike.baiu.com/ 2 https://en.wikipeia.org/ language which are unstructure, noisy an har for the NLP systems to unerstan. Therefore, there is a great nee to represent these wors with semantic meanings in a structure fashion that can be easily unerstoo by the NLP systems. Wor Description in Baiu Wiki Sememes 缕析 (analysis in etail) 逐条认真的分析缕析行情 (A careful analysis. Analyze the market in etail) 分析 (analyze) 详 (etaile) Table 1: Example of Sememe Preiction via Wiki Description Wors can be represente with semantic subunits from a finite set of limite size. For example, the wor lovers can be approximately represente as {Human Frien Love Desire}, the wor 缕析 (analysis in etail) can be represente as analyze an etaile (see Table 1). Linguists efine sememes as this kin of semantic sub-units of human languages (Bloomfiel, 1926) that express semantic meanings of concepts. This iea is similar to the iea of language universals (Goar an Wierzbicka, 1994). To represent the semantic meaning of wors with the sememes, researchers buil sememe base knowlege bases (KBs) by annotating wors with a pre-efine set of sememes. One of the usable an most wellknown sememe KBs is HowNet (Dong an Dong, 2006). In the ontology of HowNet, there are over 2,000 sememes. They manually annotate more than 100,000 wors an phrases in Chinese in a hierarchical structure. Because of its explicit way to represent knowlege (the number of sememes is limite, which emboy knowlege), HowNet is easy to be aopte in NLP systems while remains unerstanable to human beings. The manual construction of such KBs is very time-consuming an labor-intensive, for instance, HowNet was built for more than 10 years by a number of linguistic experts. However, many of

the annotate wors in the KBs are alreay out of ate, in the meanwhile, the progress of manual construction can not catch up with the emerging spee of the new wors. In the real worl, there are many ifferent wiki websites, such as Wikipeia, 3 Baiu Wiki, 4 Huong Wiki, 5 an so on. These websites contain millions of high-quality articles escribing the worl knowlege emboie in the wors an phrases. For instance, Baiu Wiki contains 15,243,192 articles, mostly in Chinese. When people are not familiar with some wors, nowaays they prefer to look up the escriptions in these wiki websites. However, for the commonly use classical wors, ictionaries are still a valuable source, in which people look up the meanings of the wors. Therefore, we think it is reasonable to use resources from both kins of web pages. In this paper, we inten to explore a way to preict lexical sememes of a wor base on its corresponing escriptions in the wiki (ictionary) pages. We view this task as a weakly orere multi-labeling problem (the orer is alreay given by HowNet). Vinyals et al. (2015) claime that the orer between labels matters, an they propose to use seq2seq learning for the multi-label problem. Nam et al. (2017) propose several ways to organize the orer of labels so that seq2seq woul work better on the MLC task. We observe that the classical sequence-to-sequence (seq2seq) moel makes a strong assumption on the orer of the labels, which is not suitable for the multi-label problem. Assuming the orer between tokens with heuristic rules is also problematic. Therefore, we propose a novel label istribute seq2seq moel (LD-seq2seq) with a soft loss function to solve the problem. Since single wiki escription may involve noise, an is not comprehensive, we esign a multi-resource encoer that can take various escription resources (e.g., escriptions from ifferent wiki websites) into consieration. Our contributions lie in the following aspects: We propose to preict the sememes of a wor base on its textual escriptions in wiki pages, which transforms the unstructure textual knowlege from wiki pages into istribute semantic knowlege. 3 http://www.wikipeia.org 4 http://baike.baiu.com 5 http://www.baike.com We view this task as a weakly orere multilabeling problem an propose a Label Distribute Seq2seq moel with a soft loss function to solve the problem. We o extensive experiments on sememe preiction an observe that our moel beats all the baselines. Our moel even outperforms amateur human annotators on a ranom subset of the test set. Furthermore, we give a etaile analysis of the error reasons with concrete examples an possible solutions. 2 Relate Work HowNet has been wiely use in various NLP tasks such as wor similarity computation (Liu an Li, 2002), wor sense isambiguation (Duan et al., 2007) (similar to wor Clustering (Jin et al., 2007)), sentiment analysis (Huang et al., 2014) an name entity recognition (Li et al., 2016). Niu et al. (2017) claime that using wor sememe information in HowNet can improve wor representation. Zeng et al. (2018) propose to expan the Linguistic Inquiry an Wor Count (Pennebaker et al., 2001) lexicons base on wor sememes. Xie et al. (2017) propose to preict sememes of a wor by measuring the similarity between the jointly learne wor embeings an sememe embeings. Their solution is simple an straightforwar. However, in many of the cases in real-worl applications, we o not have access to the accurately learne wor embeings, especially for the new wors. First, it is har to collect enough context ata for learning the embeing of new wors. Secon, in most of the eep learning applications, the wor embeings are fixe after training, which makes it ifficult to learn the embeing of the new wors an fix them into the system. There are three main types of traitional machine learning algorithms for the Multi- Label Classification (MLC) task, problem transformation methos (Boutell et al., 2004; Tsoumakas an Vlahavas, 2007; Rea et al., 2011),algorithm aaptation methos (Clare an King, 2001; Zhang an Zhou, 2007; Fürnkranz et al., 2008) an ensemble methos (Tsoumakas et al., 2011; Szymański et al., 2016). Simple neural networks moels have also been applie to eal with MLC tasks (Zhang an Zhou, 2006; Nam et al., 2014;

Benites an Sapozhnikova, 2015; Kurata et al., 2016). Li et al. (2015) propose to consier the previously generate labels as features for preicting new ones. Yang et al. (2018) further evelope this iea to use recurrent neural networks to moel the correlation between labels. 3 Our Approach In this section, we show our solution to the sememe preiction task. An overview of our moel is shown in Figure 1. 3.1 Task Definition Given one (single resource) or a few (multiple resources) textual escriptions D = ( (1), (2),, (m) ) of a wor from the wiki pages, our goal is to preict the corresponing sememes s = (s 1,s 2,,s n ) of the wor, where s is a subset of the sememe label space S. Our task can be moele as fining an optimal label sequence s that maximizes the conitional probability p(s D), which is calculate as follows, p(s D) = n p(s i s 1,s 2,,s i 1,D) (1) i=1 3.2 Basic Seq2seq Moel for Multi-Label Vinyals et al. (2015) propose to use seq2seq paraigm to eal with the problem of preicting labels that form a set. They claime that the orer of the labels matters even for labels that form a set. Encoer: For one textual escription i with l wors in D, it is first encoe to l hien states (h 1,h 2,,h l ) by the biirectional gate recurrent neural networks (BiGRNN), the last of which is treate as the vector v for the textual escription, h t = GRU(h t 1,x t ) (2) Decoer: The ecoer generates the sememes one by one base on the vector v. At the t-th time of ecoing, the probability of the sememe p t is calculate as follows, s t = GRU([s t 1;c t;e t 1]) (3) p t = softmax(ws t +b) (4) c t = l αt,ihi (5) i=1 score t,i = v T a tanh(w as t +U ah i) (6) exp(score t,i) α t,i = l j=1 exp(scoret,j) (7) where s t is the hien state at the t-th time, c t is the context vector calculate with the attention mechanism over the hien states of the escriptions (h 1,h 2,,h l ), e t 1 is the embeing of the sememe with the highest probability preicte at the (t 1)-th time. 3.3 Propose Label Distribute Seq2seq Moel We think that even though the orer of the labels matters, we shoul not strictly restrict the orer of the labels. However, the traitional cross entropy loss function applie to the classical seq2seq moel actually puts a strict assumption on the orer of the labels. For example, if the thir token in the target sequence is preicte at the first place, it will be punishe with no ifference to preicting an utterly wrong token. To eal with the task of preicting weakly orere labels (or even unorere labels), we propose a soft loss function instea of the original har cross entropy loss function, loss = y ilog(p i ) (8) i Instea of using the original har one-hot target probability y i, we use a soft target probability istribution, which is calculate accoring to y i an the sememe sequence s of this sample. Let q s enote the bag of wors representation of s, where only the slots of the sememes in s are fille with 1s. We use a function ξ to project the original target label probability y into a new probability istribution y, y t = ξ(y t,q s ) (9) This function is esigne so as to ecrease the harsh punishment when the moel preicts the labels in the wrong orer. In this paper, we apply a simple yet effective projection function as Equation (10). It shoul be note that this is an example implementation, one can also esign more sophisticate projection functions if neee, ξ(y t,s) = ((q s /M)+y t )/2 (10) where M is the length of s. This function means that at the t-th time of ecoing, for each target token s i, we first split a probability ensity of 1.0 equally across all the M tokens into 1/M. Then, we take the average of this probability istribution an the original probability y t to be the final probability istribution at time t.

Decoer p t-1 p t p t+1...... A... w w w w 1 2 l-1 l... Encoer Gate Mechinism Decoer Encoer A... w 1 w 2 w l-1 w l-1 A... w w w w 1 2 l-1 l Figure 1: An overview of propose label istribute seq2seq moel. We compute the loss base on a soft probability istribution rather than the one-hot istribution. Figure 2: An illustration of the multi-resource moel, the ifferent escriptions vectors an context vectors are combine with gate mechanism. In the figure we show two escriptions, while our moel can be extene to multiple escriptions. 3.4 Multi-Resource Moel Description resource from a single source can be unreliable an is not able to express the meaning of the wor comprehensively. In this paper, we propose to use a multi-resource encoer to make use of escriptions from multiple resources. An overview of this moel is shown in Figure 2. To emonstrate the effectiveness of multiple resources, we implement our encoer using two resources for simplicity, but it can be extene to more resources without much effort. Assume for a wor, we have two textual escriptions (1) an (2), containing l (1) an l (2) wors (w (1) 1,w(1) 2,,w(1)) an l (1) (w (2) 1,w(2) 2,,w(2)) respectively. We use Bil (2) GRNN to encoe the two escriptions separately into two sequences of hien states,,h(1)) an (h (2) l (1) 1,h(2) 2,,h(2)). l (2) (h (1) 1,h(1) 2 We use the hien states at the last time step h (1) l (1) an h (2) l (2) as the representation for the corresponing escriptions (1) an (2), which we enote as v (1) an v (2). To combine the two vectors v (1) an v (2) into one uniform v, we apply the gate mechanism, which is calculate as follows, g 1 = σ(w 1[v (1) g 2 = σ(w 2[v (1) ;v(2) ;v(2) ]+b1) (11) ]+b2) (12) v = g 1 v (1) +g 2 v (2) (13) whereσ inicates thesigmoi function, W 1,W 2, b 1 an b 2 are learnable parameters, means the element-vise multiplication. The ecoer part follows the same structure as the moel in Section 3.3, except that we first separately calculate the context vectors c (1) t an c (2) t with attention mechanism. Then we use gate mechanism to combine the two vectors c (1) t an c (2) t into one context vector c t. The gate mechanism here follows the same process for the combination of v (1) an v (2) with ifferent parameters. 4 Experiment 4.1 Dataset HowNet: HowNet is a knowlege base that uses sememes to represent the semantic meaning of a wor or a phrase. There are over 100,000 annotate wors in HowNet. Wors can have multiple senses. Each sense is further represente by a combination of no more than 8 sememes. The sememes form a hierarchical structure. However, following the settings of most of the previous work, we o not consier the specific relations between sememes, but only consier the orer between them, which we call weakly orere sememes. For simplicity, we o not consier multiple senses, an just assume that the first sense of the wor is its basic sense. Wiki Pages: Because the wors annotate in the HowNet consist of both common wors an newly emerge wors (by that time), we choose two escription sources for the wors, Baiu Wiki 6 ( 百度百科 ) an Baiu Dictionary 7 ( 百度词典 ). Baiu Wiki contains 15,244,702 articles that are eite by the volunteers with a lot of new emerge wors, while Baiu Dictionary is similar to language ictionaries (still from crow-source) with 6 http://baike.baiu.com 7 http://ict.baiu.com

better quality efinitions an escriptions for common wors. We get the textual escriptions of the wors annotate in the HowNet from Baiu Wiki an Baiu Dictionary, an get 62,810 wors that have attache escriptions (if at least one of the escriptions from two sources exist, it is counte as one case). We ranomly split the ata into three parts, Train (80%), Dev (10%) an Test (10%). 4.2 Baseline Moels ML-KNN (Multi-label KNN): This is the k- Nearest Neighborhoo classification metho aapte to multi-label classification. LP (Label Powerset): LP (Tsoumakas an Vlahavas, 2007) is a problem transformation approach to multi-label classification that transforms a multi-label problem to a multi-class problem with one multi-class classifier traine on all unique label combinations foun in the training ata. CC (Classifier Chain): For the label space with L labels, CC (Rea et al., 2011) trains L classifiers orere in a chain accoring to the Bayesian chain rule. BR (Binary Relevance): BR (Boutell et al., 2004) transforms a multi-label classification problem with L labels in the label space into L single-label separate binary classification problems using the same base classifier. RNN-MLLR (RNN multi-label logistic regression): This moel uses the same multiresource encoer of our propose moel, while uses the one-versus-all logistic regression multi-label classifier to preict the sememes base on the encoe vector of the escriptions. 4.3 Experiment Details For the textual escriptions, we use characters as the input, the vocabulary size of characters is 11,097. We ranomly initialize the character embeings. There are 2,185 sememes in the HowNet. We use wor2vec (Mikolov et al., 2013) toolkit to pre-train the embeings of the sememes with efault parameters of the coe to capture the co-occurrence relationship of the sememes. The embeings of the sememes are fine-tune uring training. The imension of both the character embeings an sememe embeings are 200. All the imensions of hien states are set to Moel P R F1 ML-KNN 29.34 9.26 14.08 LP 26.06 23.92 24.94 BR 32.30 21.59 25.88 CC 33.33 21.37 26.04 RNN-MLLR 44.26 33.54 38.16 Basic Seq2seq 43.86 40.92 42.34 LD-Seq2seq (Proposal) 47.96 41.99 44.78 Table 2: Comparison with ifferent baseline moels. All the moels use two wiki resources in this table. P means Precision, R means recall rate. 300. The batch size is 20. < EOS > token is ae to the en of a sememe sequence to inicate when to stop preiction. We use Aam optimizer (Kingma an Ba, 2014) to minimize the loss. We train our moel for 10 epochs, an choose the moel parameters from the epoch that gets the highest F1 score on the Dev set. 4.4 Results an Analysis We use micro Precision (P), Recall rate (R) an F1 score as the evaluation metrics. Comparison with Baselines: In Table 2, we show our experiment results compare with the baseline methos. From the results we can see that clustering base metho ML-KNN performs the worst for sememe preiction. We assume that this is because the textual escriptions are very iverse, which makes KNN har to etermine the borers among space of ifferent labels. Methos that aim to transform classifiers to multi-label task perform closely to each other, with F1 scores aroun 25%. Compare with traitional machine learning methos (ML-KNN, LP, CC, BR), neural network base methos (RNN-MLLR, Basic seq2seq) performs much better, which beats other baselines by a big margin. Although RNN-MLLR achieves goo results, it is still not as goo as seq2seq base moel. We assume that this is because MLLR base moels are not very goo at moeling the connections between labels. In our sememe preiction task, the sememes are in weak orer. Moreover, some sememes are strongly relate to some others an some sememes often co-occur. For instance, when the sememe Emotion occurs, it is likely to be followe by FeelingByBa, generic an esire. Our propose Label Distribute seq2seq moel gets the best performance, we assume that this is because even though orer between labels matters (Vinyals et al., 2015), for

Metho Precision Recall F1 Human 21.89 57.36 31.69 Human+Wiki 23.62 62.79 34.32 Proposal 53.92 42.64 47.62 Table 3: Comparison with Human performance on a ranom subset of test samples. Human means that the annotator oes not have access to the wiki escriptions. Human+Wiki means the annotator has access to the wiki escriptions. the weakly orere multi-label problem, a strong assumption on orering hurts the performance, an our soft loss function can effectively relieve the problem. Comparison with Human Performance: In Table 3, we show the results of amateur human an our moel s result on a subset of the test set. We ranomly select 100 samples from the test set, an ask human annotators to select 1 5 sememes out of 20 that they think can escribe the meaning of the wor. Because the annotators o not have backgroun knowlege on HowNet, the annotation task is actually simpler than annotating from scratch. The annotators are highly eucate (with proper knowlege) amateur native speakers without special training on linguistics or the annotation system of HowNet. We guarantee that all the correct sememes are within the selecte 20 sememes. The annotators are aske to first preict the sememes base on their common sense (Human), then they are provie with the escriptions from Baiu Wiki an aske to o the work again (Human - Wiki). From the results we can see that even for human beings, it is har to preict the sememes completely right without special training on the annotation system of HowNet. Human annotators are able to unerstan the semantic meaning of the wor an can unerstan the escription very well. However, they ten to preict more sememes than there actually are, which is reflecte by the high recall rate. The imbalance between precision an recall inicates that the sememe architecture of HowNet may have the problem of being too finegraine, many sememes other than the actual ones are also relate to the wor, meaning wise. Still, by referring to wiki escriptions, human annotators are able to preict more precisely, this is because there are some rare wors or entities in the ataset that people selom use in the real life. Although the recall rate of our propose moel is not as high as human annotators, its precision beats human annotators by a big margin, which makes the F1 score higher than human. We assume that this is because by learning from the big bulk of training ata, our moel is more likely to be consistent with the logic of the annotation system. Effect of Propose Soft Loss Function: From Table 2 we can see that seq2seq moel with our novel soft loss (LD-Seq2seq) performs much better than the basic seq2seq moel. We think that this is because our novel loss function eases the restriction on the orer between labels. For example, assume the target sememes are (s 1,s 2,s 3 ) in orer. At the first time step of ecoing, the one-hot loss function woul strongly punish the ecoer from giving s 2 or s 3 probabilities, which may confuse the ecoer, because at the moment the ifference between time step 1 an time step 2 may not be significant when the orer of the labels are not obvious. However, our soft loss function woul still lea the ecoer to firstly choose s 1, while the two labelss 2 ans 3 are also encourage with some probability less than s 1. The experiment results show that this moification is very effective to make seq2seq work well on the multilabel problem. Effect of Applying Multi-Resource: From Table 4 we observe that using multiple resources instea of a single one can greatly improve the performance. This correspons with our expectation as more escriptions can provie more comprehensive information of the wor from various aspects. Moreover, since the alignment between sememes an escriptions are noisy, the gate mechanism can automatically ecie how much one escription contributes to the preiction base on its relateness. Between the two resources we use (Baiu Wiki an Baiu Dictionary), ictionarystyle resource provies much higher precision (47.15 42.91), we assume this is because the escriptions in this kin of resource have better quality in general. However, many new wors an rare wors are not inclue in the ictionary an some of the entries in the Baiu Dictionary have noisy escriptions as well (e.g., English escriptions instea of Chinese), so ictionary alone oes not preict as well as the multi-resource one.

Correct 24 % 24 % Plausible Wrong 29 % 23 % Partial Close Literal 20.69 % 24.14 % 17.24 % 3.45 % 6.90 % Too Simple Unable 17.24 % 10.34 % Pattern Complex Polysemy Figure 3: The istribution of preiction result types. Figure 4: The istribution of error types in Wrong. Moel Precision Recall F1 SingleRes-Wiki 42.91 27.75 33.70, SingleRes-Dict 47.15 29.83 36.54 MultiRes 47.96 41.99 44.78 Table 4: Results of using ifferent resources. The seq2seq moel applies the basic architecture without aaptation to the multi-label problem. SingleRes inicates that the encoer only consiers a single textual resource. MultiRes inicates that the encoer consiers multiple textual resources (Wiki an Dictionary). 4.5 Error Analysis an Case Stuy In Figure 3, we show the istribution of the results from a ranomly chosen subset of test samples (100 samples) an give some concrete examples of the sememe preiction in Table 5. We use accuracy (the case is viewe as right only if all of its sememes are matche) as the evaluation metric in the error analysis. correct means the preiction is completely right. In Figure 3, Wrong means that our moel makes wrong preictions. For instance, for the wor 国有化 (nationalize), the stanar answer is -ize an central, while our preiction is place, own, country an politics, none of the preicte sememes are in the answer set, but these sememes actually make sense, because nationalize is inee to make something own by the country, which is usually an action of politics, our preiction fails to capture the ynamic proceure of -ize, but still this sequence of sememes can escribe some aspects of the wor, thus being able to help in ownstream tasks. Partial means that part of the result is correct or the result is a subset of the real answer, for instance, for the wor 宦门 (official family), our preiction is family an official, while the correct answer is family, human an official, our preiction captures most part of the meaning, an the missing sememe human can actually be euce by the sememe family. Plausible means that we think the preicte sememes can also reflect the meaning of the wor or better, even ifferent from the original ones, for example, for the wor 混纺 (blen fabric), our preiction is material, clothing an tool while the answer is artifact, clothing an tool. The ifference between two sequence of sememes lie between material an artifact, blen fabric is clearly an artificial material, both the answer an our preiction captures one aspect of the wor, our sequence of sememes are even better for presenting the semantic meaning of the wor. The existence of plausible preictions (not entirely equal to the reference) may be relate to the annotation system of HowNet. Some of the sememes we observe in the reference are very sparse, for instance, weatherfine is a sememe in HowNet, which we think can be split into other sememes like weather an begoo. Except for the wrong preictions (29%), we observe that the rest of the preiction result types are all similar to or can be substitution to the stanar sememes of the wor. We think for these parts of the preictions, the preicte sememes are able to represent most part of the meaning of the wor, which is helpful for ownstream tasks. Actually, even part of the wrong preictions can be of help, which we will explain in etail. In Figure 4, we further split the reason of the Wrong preictions in Figure 3 into seven categories. Literal: Among the reasons, a large part ( Literal 24.14%) is because the moel is istracte by the literal meaning of some part of the escriptions that is not the key information about the wor. For example, for the wor 磕 (knock),

Wor Reference Preiction Category 历史唯物主义 (historical 知识 (knowlege), 思想 (thinking), 物知识 (knowlege), 思想 (thinking), 物 Correct materialism) 质 (physical), 主 (primary), 最 (most) 质 (physical), 主 (primary), 最 (most) 宦门 (official family) 家庭 (family), 人 (human), 官 (official) 家庭 (family), 官 (official) Partial 混纺 (blen fabric) 人工物 (artifact), 衣物 (clothing), 用具材料 (material), 衣物 (clothing), 用具 Plausible (tool) (tool) 国有化 (nationalize) 变性态 (ize), 归属中央 (central) 地方 (place), 有 (own), 国家 (country), 政 (politics) Wrong Table 5: Examples of wor an sememes. Reference inicates the stanar sememes in HowNet, Preiction inicates our preicte results. The categories of the examples are corresponing to Figure 3. our moel preicts the sememes position an wholly, because there are expressions about position like 碰在硬东西上 (knocke on a har thing), 人与人之间 (between people) an 使附着物掉下来 (make the attachment off), these expressions are all concerne about the position of something, which mislea the moel. Close: 20.69% of the wrong preictions are actually close to the answers. 国有化 (nationalize) we mentione above is an example of this type. Polysemy: 17.24% of the wrong preictions are because of polysemy, that is, some wors have multiple meanings, the stanar sememes refers to a ifferent meaning from the escription. For example, 一如 can mean title of a rank in karate or the same, the sememes refer to the meaning of the same, while the escription in the wiki is about karate. The mismatch between the escription an the answer causes such problems. Complex: 10.34% of the wrong preictions are because the escriptions are too complex or long, which usually inclue many other meanings of the wor. Because we only use a heuristic way to align the senses with the escription, an the senses in the escriptions of the wiki are not clearly aligne, sometimes the sememes in the reference is only a part of the escription, which is not in the ominant position. For example, the wor 践履 can mean step on an fulfill, step on is the original meaning of the wor, however, the most common usage of this wor now points to the meaning of fulfill. In the escription, a large part is escribing the meaning step on an giving instances of this meaning. This makes our moel focus on the wrong part of the escription, thus making wrong preictions. Pattern: 6.9% of the wrong preictions are because the pattern of the annotate answer, most of which are involve with the explanation of some rarely use Chinese characters. For example, the wor 轲 means wooen vehicle, but this original meaning is rarely use now, an the wor is more acknowlege as part of the name of a saint in China 孟轲 (Mencius), so the sememes in the reference are character an China. Too Simple: 3.45% of the wrong preictions are because the escriptions from the wiki are too simple. For example, the escription of the wor 猛子 is 扎猛子, which is just another way of expression without much explanation. Unable: We can not tell why our moel fails to preict the right answer for the rest of the wrong preictions (17.24%). Uner this circumstance, the escriptions are clear, but the preicte sememes are not concerne about the escription. To solve the mistakes we mention above, several possible methos can be applie. First, a more powerful wor sense alignment step can be applie, this can make the escription an the sememes correspon to each other. Secon, the annotation system can be moifie, so that the sparsity of the sememes can be reuce an less overlappe. Thir, context of the wors can be introuce to help istinguish between ifferent senses. 5 Conclusion an Future Work In this paper, we focus on the task of learning knowlege from unstructure textual escriptions from wiki pages. We choose to represent wors an phrases with weakly orere sememes. To preict the sememes of a wor base on the escriptions, we propose to apply a seq2seq base moel. We observe that irectly applying seq2seq framework is problematic because of its strong assumption on the orer between labels. To make seq2seq moel more suitable for multi-label tasks, we propose a novel soft loss function that turns the one-hot target label into a probability istribution. To make preiction more accurate, we also propose a multi-resource encoer that makes use of multiple wiki resources. Experiment results show our label istribute seq2seq moel works well on

the sememe preiction task. The performance is even better than amateur human on a ranomly selecte subset of the test set. We make a etaile error analysis an propose possible solutions. In the future, we woul like to explore how to better align the wor senses with the articles in the wiki pages. It woul also be interesting to take the more sophisticate structures of sememes into consieration. References Fernano Benites an Elena Sapozhnikova. 2015. Haram: a hierarchical aram neural network for large-scale text classification. In Data Mining Workshop (ICDMW), 2015 IEEE International Conference on, pages 847 854. IEEE. Leonar Bloomfiel. 1926. A set of postulates for the science of language. Language, 2(3):153 164. Matthew R Boutell, Jiebo Luo, Xipeng Shen, an Christopher M Brown. 2004. Learning multilabel scene classification. Pattern recognition, 37(9):1757 1771. Amana Clare an Ross D King. 2001. Knowlege iscovery in multi-label phenotype ata. In European Conference on Principles of Data Mining an Knowlege Discovery, pages 42 53. Springer. Zhenong Dong an Qiang Dong. 2006. Hownet An The Computation Of Meaning (With C-rom). Worl Scientific. Xiangyu Duan, Jun Zhao, an Bo Xu. 2007. Wor sense isambiguation through sememe labeling. In IJCAI, pages 1594 1599. Johannes Fürnkranz, Eyke Hüllermeier, Enelo Loza Mencía, an Klaus Brinker. 2008. Multilabel classification via calibrate label ranking. Machine learning, 73(2):133 153. Cliff Goar an Anna Wierzbicka. 1994. Semantic an lexical universals: Theory an empirical finings, volume 25. John Benjamins Publishing. Minlie Huang, Borui Ye, Yichen Wang, Haiqiang Chen, Junjun Cheng, an Xiaoyan Zhu. 2014. New wor etection for sentiment analysis. In Proceeings of the 52n Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 531 541. Peng Jin, Xu Sun, Yunfang Wu, an Shiwen Yu. 2007. Wor clustering for collocation-base wor sense isambiguation. In International Conference on Intelligent Text Processing an Computational Linguistics, pages 267 274. Springer. Dieerik P. Kingma an Jimmy Ba. 2014. A metho for stochastic optimization. abs/1412.6980. Aam: CoRR, Gakuto Kurata, Bing Xiang, an Bowen Zhou. 2016. Improve neural network-base multi-label classification with better initialization leveraging label cooccurrence. In Proceeings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 521 526. Li Li, Houfeng Wang, Xu Sun, Baobao Chang, Shi Zhao, an Lei Sha. 2015. Multi-label text categorization with joint learning preictions-as-features metho. In Proceeings of the 2015 Conference on Empirical Methos in Natural Language Processing, pages 835 839. Wei Li, Yunfang Wu, an Xueqiang Lv. 2016. Improving wor vector with prior knowlege in semantic ictionary. In Natural Language Unerstaning an Intelligent Applications, pages 461 469, Cham. Springer International Publishing. Qun Liu an Sujian Li. 2002. Wor similarity computing base on hownet. Computational linguistics an Chinese language processing, 7(2):59 76. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrao, an Jeff Dean. 2013. Distribute representations of wors an phrases an their compositionality. In Avances in neural information processing systems, pages 3111 3119. Jinseok Nam, Jungi Kim, Enelo Loza Mencía, Iryna Gurevych, an Johannes Fürnkranz. 2014. Largescale multi-label text classification revisiting neural networks. In Joint european conference on machine learning an knowlege iscovery in atabases, pages 437 452. Springer. Jinseok Nam, Enelo Loza Mencía, Hyunwoo J Kim, an Johannes Fürnkranz. 2017. Maximizing subset accuracy with recurrent neural networks in multilabel classification. In Avances in Neural Information Processing Systems, pages 5419 5429. Yilin Niu, Ruobing Xie, Zhiyuan Liu, an Maosong Sun. 2017. Improve wor representation learning with sememes. In Proceeings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 2049 2058. James W Pennebaker, Martha E Francis, an Roger J Booth. 2001. Linguistic inquiry an wor count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001. Jesse Rea, Bernhar Pfahringer, Geoff Holmes, an Eibe Frank. 2011. Classifier chains for multi-label classification. Machine learning, 85(3):333. Piotr Szymański, Tomasz Kajanowicz, an Kristian Kersting. 2016. How is a ata-riven approach better than ranom choice in label space ivision for multi-label classification? Entropy, 18(8):282.

Grigorios Tsoumakas, Ioannis Katakis, an Ioannis Vlahavas. 2011. Ranom k-labelsets for multilabel classification. IEEE Transactions on Knowlege an Data Engineering, 23(7):1079 1089. Grigorios Tsoumakas an Ioannis Vlahavas. 2007. Ranom k-labelsets: An ensemble metho for multilabel classification. In European Conference on Machine Learning, pages 406 417. Oriol Vinyals, Samy Bengio, an Manjunath Kulur. 2015. Orer matters: Sequence to sequence for sets. arxiv preprint arxiv:1511.06391. Ruobing Xie, Xingchi Yuan, Zhiyuan Liu, an Maosong Sun. 2017. Lexical sememe preiction via wor embeings an matrix factorization. In Proceeings of the 26th International Joint Conference on Artificial Intelligence, pages 4200 4206. AAAI Press. Pengcheng Yang, Xu Sun, Wei Li, Shuming Ma, Wei Wu, an Houfeng Wang. 2018. Sgm: Sequence generation moel for multi-label classification. arxiv preprint arxiv:1806.04822. Xiangkai Zeng, Cheng Yang, Cunchao Tu, Zhiyuan Liu, an Maosong Sun. 2018. Chinese liwc lexicon expansion via hierarchical classification of wor embeings with sememe attention. Min-Ling Zhang an Zhi-Hua Zhou. 2006. Multilabel neural networks with applications to functional genomics an text categorization. IEEE transactions on Knowlege an Data Engineering, 18(10):1338 1351. Min-Ling Zhang an Zhi-Hua Zhou. 2007. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038 2048.

arxiv: v1 [cs.cl] 16 Aug 2018