The Research of Word Sense Disambiguation Method Based on Co-occurrence Frequency of Hownet

Similar documents
A. make a speech B. receive an invitation C. contribute some money D. attend a reception 12. the morning of the wedding ceremony, the bride and groom

Publishing your paper in IOP journals

Autobiographies 自传. A Popular Read in the UK 英国流行读物. Read the text below and do the activity that follows. 阅读下面的短文, 然后完成练习 :

Scopus New Interface and its application in research. Elsevier Greater China 2014

History of Evolutionary Biology: What did the Science tell us?

Biography Of Entrepreneurs Pdf Download >>>

Om Jai Ambe Gauri Pdf Download ->->->->

MANKS. Oval Plate (36cm) HKD 1,860 Pitcher HKD 1,325 Oval Plate (22x25cm) HKD 625 LTD

西北工业大学现代远程教育 专科入学测试英语复习大纲 ( 第八版 )

Research on concept-sememe tree and semantic relevance computation

MDPI Introduction and Editorial Procedure

关于台词的备注 : 请注意这不是广播节目的逐字稿件 本文稿可能没有体现录制 编辑过程中对节目做出的改变

Author Academy: Your Guide to Publication Success. Lu Ye Managing Director, China Editorial Director, Physical Science & Engineering April 8, 2015

2013 年 外研社杯 全国英语写作大赛 决赛样题及评分细则

Part Ⅰ 语音基础知识运用 (60 分 )

USB Microphones Inadequate Shielding Functional Test 麦克风屏蔽不良功能测试

James Davies Lessons Website: Break a Bad Habit! 打破坏习惯! LANGUAGE FOCUS: Higher-level lifestyle context, signposts & vocab

Korea-China Economic Relations and Trade 韩中经贸关系

Do you know the story about Vince? It was a true story. Vince was an English boy and he was eight years old. He didn't like soap or water.

Author Academy: Your Guide to Publication Success. Leana Li Publishing Editor, Human Sciences April 15, 2015

北北京市朝阳区 高三年年级第 一次综合练习英语学科测试第 一部分 : 听 力力理理解 ( 略略 )

第 15 课生日晚会 (Lesson 15 Birthday Party)

01 常用单词 听写 Words Dictation

Zhang Shiying and Chinese Appreciation of Hegelian Philosophy

Chinese Word Sense Disambiguation with PageRank and HowNet

THE 14 TH CHINA INTERNATIONAL CHORUS FESTIVAL

Klystron Output Resonator - Particle-in-Cell (PIC) Simulation

密云区 学年度第一学期期末考试 初三英语试卷

电影作者 的工作 在这样的环境下, 独立电影作者可以做些什么, 来主动改善恶劣的电影生存 - 传播环境? 必须开辟新的可能 新的交流空间 传播方式, 来对抗环境的挤压 期待与会各位自自由展开讨论

A Study of Appraisal in Chinese Academic Book Reviews

20 Times Pop Music Taught Us to be Better People

N.CIA.2 I can use memorized language and very basic cultural knowledge to interact with others Easy Step to Chinese: Level 1,

Intermediate Conversation Material #16

2018 年度学力試験問題 芦屋大学 一般入試 (C 日程 ) 2018 年 3 月 16 日 ( 金 ) 実施 志望学部 学科学部学科 フリガナ 受験番号 氏名

How to use the resources in this course to learn Chinese How to use the resources in this course to teach Chinese 练习本教师使用指南練習本教師使用指南

Subway Surf Free Download Full Version For Pc ->>->>->>

嵌入式技术和物联网发展新趋势 -IoT OS 和边缘计算

六级 口语考试流程 : 模拟题 3 号. 考官录 音 :Thank you. OK, now that we know each other, let s go on. First, I d like to ask each of you a question.

SPECIFICATION 广州视源电子科技股份有限公司. MODEL: TP.VST59S.PC1(Asia-75W) Part Number: Guangzhou shiyuan Electronics Co.,Ltd. Approved by Shiyuan

There will be no extra time to transfer answers to the Answer Sheet; therefore, you should write AIJL your answers on the

1. 疑問文 1 3. 助動詞 3 4. 受動態 4 5. 不定詞 5 7. 動名詞 関係詞 接続詞 比較 代名詞 形容詞 副詞 前置詞 仮定法

江苏省 2017 年普通高校专转本选拔考试

复旦大学 2018 年本科外国留学生入学考试大纲

2018 北京市石景山区高三 ( 上 ) 期末 英语

Specification. FireFly. 1/3" CMOS Sensor 1200 TVL. NTSC / PAL optional. 16:9 / 4:3 optional. Internal. >52dB (AGC OFF) CVBS. 2.1mm.

外国语言文学 ( 英语语言文学 ) 第一章 语言学 考试大纲一 考试要点 :

华莱士 史蒂文斯的 看黑鸟的十三种方式 的解构主义解读

UTG NS-F310. Speaker Enceinte OWNER S MANUAL MODE D EMPLOI BEDIENUNGSANLEITUNG MANUAL DE INSTRUCCIONES

2014 年复旦大学本科外国留学生入学考试大纲

Part II Listening Comprehension (30 minutes) B) It helps singers warm themselves up. C) Singers use it to stay away from colds.

Java Books Free Pdf Download ->>->>->> DOWNLOAD

I can t remember what I dreamt of last night. ( 私は昨晩夢で見たことを覚えていません )

California Foreign Language Project SAILN Level III ACTFL Reading Proficiency Unit. Mandarin Anne Li 再别康桥 Farewell Cambridge April 2016

台北縣立江翠國中九十七學年度第一學期八年級第一次段考英語科 ( 共 4 頁 )

Britannica 6 Book Interactive Science Library >>>

国家开放大学 ( 中央广播电视大学 )2015 年秋季学期 " 开放专科 " 期末考试 英语昕力 (3 ) 注意事项 一 将你的学号 姓名及分校 ( 工作站 ) 名称填写在答题纸的规定栏 内 考试结束后, 把试卷和答题纸放在桌上 试卷和答题纸均不得带

UTG NS-F210. Speaker Enceinte OWNER S MANUAL MODE D EMPLOI BEDIENUNGSANLEITUNG MANUAL DE INSTRUCCIONES

SONG Xi-xi, LING Qian. Northwest Normal University, Lanzhou, China

学年第一学期第一次月考试卷 八年级英语 一 听说部分 ( 共 15 分 ) 2.Did she understand the words in Remoe and Juliet? A. No B. Yes

HONR400 Honours Project Guidelines Governing the Format of Abstract, Poster & Honours Thesis

Lesson 27: Asking Questions/Clarifications (20-25 minutes)

The total marks for this examination are 100 points. Time. allowed for completing this examination is 90 minutes.

中央广播电视大学 学年度第一学期 " 开放本科 " 期未考试 高级英语 ( 1 ) 试题 注意事项

CHI Hui-hui, MA Shu-xia. University of Shanghai for Science and Technology, Shanghai, China

IDP2800SQ (SL28P2K) USER MANUAL IDP2800SQ x 2048 TFT LCD MONITOR 28.05" SQUARE FORMAT. Air Traffic Control Application

IoT Open System Architecture

On Advertisement Translation from the Perspective of. English-Chinese Cultural Differences

Knowledge and Wisdom OBJECTIVES. Unit 6. Memorable Quote. Procedures. Memorable Quote Pre-reading questions Background information Watch & Discuss

Author Academy: Your Guide to Publication Success. Irene Zhao 赵玮 Marketing Manager May 22, 2014

Milky Chance - Sadnecessary (2013).torrent >>> DOWNLOAD

Operating Instructions. Unit Contents. Emphasis, Ellipsis and Inversion. How to Find the Equivalent Items (1) Operating Instructions

Scholarship 2017 Chinese

Some Experiences on BEPCII SRF System Operation

X52 PROFESSIONAL HOTAS USER GUIDE / 用户指南

立人高級中學 104 學年度第 2 學期國一英語科第二次段考試題範圍 : 康軒第二冊 Unit 4~Unit 6 年班座號 : 姓名 :

Flat Panel Displays 平板显示技术 信息显示器件概述 张小宁 电子物理与器件教育部重点实验室 2018 年 7 月

Current Status and Challenges of Internet of Things. Xiaohui YU CATR, MIIT June 30th, 2011

2017 年黑龙江大庆市中小学教师招聘英语模拟卷

Before I Die, I Want To 在我离世前, 我要

I d rather be a doctor than an architect. ( 私は建築家より医師になりたいです ) I d sooner leave than stay in this house. ( 私はこの家にいるよりむしろ出たいです )

CHANG Yan. Qingdao University of Science and Technology, Qingdao, China

Eurovision Song Contest

Building TOEIC Reading Skills

Curriculum Vitae 任教領域 認知與學習 學習神經科學 幽默與學習 認知心理實驗設計 普通心理學

NOTES FOR CONTRIBUTORS

What do you know about him? Steve Jobs Commencement Address at Stanford University ワークシート

106 年 挑戰學習力 : 認識陸興學藝競賽英文科試題

A CRITICAL STUDY OF LIN YUTANG AS A TRANSLATION THEORIST, TRANSLATION CRITIC AND TRANSLATOR

Name: Literature is what brings a language alive and can make it sound beautiful. And you can t beat a good story, right?

Table of Contents 投稿信 (Cover letter)... 3 催稿信 (Reminder letter)... 5 回复信 (Reply/Response Letter)... 6 申诉信 (Appeal letter) 延时修回申请信 (Request for e

初一下提前看 (Unit11-Unit12) 要点聚焦 Unit11 How was your school trip? 挑战任务 : 用英文写旅行日记. 昨天上个 XX 前, in 加年份 when 字连

National Sun Yat-sen University Thesis/Dissertation Format Regulations

A Concert of Traditional and Contemporary Compositions for Chinese Yangqin Solo and Ensemble Music YANGQIN

CS Commercial Solution Series CSM-21 & CSM-32. Mixer User Guide 公共广播系统控制器用户手册

Introduction to NLP. Ruihong Huang Texas A&M University. Some slides adapted from slides by Dan Jurafsky, Luke Zettlemoyer, Ellen Riloff

8 Description.

Multi-bunch injection for SSRF storage ring

2018 年山西特岗教师招聘英语考试模拟卷三

When Conceptual Metaphors Govern Linguistic Expressions: A Textual Analysis

Transcription:

The Research of Word Sense Disambiguation Method Based on Co-occurrence Frequency of Hownet Erhong Yang, Guoqing Zhang, and Yongkui Zhang Dept of Computer Science, Shanxi University, TaiYuan 030006, P. R. China Email: zyk@sxu.edu.cn Abstract Word sense disambiguation (WSD) is a difficult problem in natural language processing. In this paper, a sememe co-occurrence frequency based WSD method was introduced. In this method, Hownet was used as our information source, and a co-occurrence frequency database of sememes was constructed and then used for WSD. The experimental result showed that this method is successful. Keywords word sense disambiguation, Hownet, sememe, co-occurrence 1. Introduction Word sense disambiguation (WSD) is one of the most difficult problems in NLP. It is helpful and in some instances required for such applications as machine translation, information retrieval, content and thematic analysis, hypertext navigation and so on. The problem of WSD was first put forward in 1949. And then in the following decades researchers adopted many methods to solve the problem of automatic word sense disambiguation, including:1) AI-based method, 2) knowledgebased method and 3) corpus-based method. [1] Although some useful results have been got, the problem of word sense disambiguation is far from being solved. The difficult of WSD is as follow: 1) Evaluation of word sense disambiguation systems is not yet standardized. 2) The potential for WSD varies by task. 3) Adequately large sense-tagged data sets are difficult to obtain. 4) The field has narrowed down approaches, but only a little. [2] In this paper, we use a statistical based method to solve the problem of automatic word sense disambiguation. [3] In this method, a new knowledge base------hownet [4,5] was use as knowledge resources. And instead of words, the sememes which are defined in Hownet were used to get the statistical figure. By doing this, the problem of data sparseness was solved to a large degree. 2. A Brief Introduction Of Hownet Hownet is a knowledge base which was released recently on Internet. In Hownet, the concept which were represented by Chinese or English words were described and the relations between concepts and the attributes of concepts were revealed. In this paper, we use Chinese knowledge base, which is an important part of Hownet, as the resource of our disambiguation. The format of this file is as follow: W_X =word E_X = some examples of this word G_X= the pos of this word DEF= the definition of this word This research project is supported by a grant from Shanxi Natural Science Foundation of China

A important concept used in Hownet that we must introduce is sememe. In Hownet, sememes refer to some basic unit of senses. They are used to describe all the entries in Hownet and there are more than 1,500 sememe all together. 3. Sense Co-occurrence Frequency Database It is well known that some words tend to co-occur frequently with some words than with others[6]. Similarly, some meaning of words tend to co-occur more often with some meaning of words than with others. If we can got the relations of word meanings quantitatively, it would have some help on word sense disambiguation. In Hownet, all words are defined with limited sememes and the combination of sememes is fixed. If we make statistic on the co-occurrence frequency of sememe so as to reflect the co-occurrence of words, the problem of data sparseness would be solved to a large degree. Based on the above thought, we built a sense co-occurrence frequency database to disambiguate word senses. 3.1 The Preprocessing Of Hownet The Hownet we downloaded from Internet is in the form of plain text. It is not convenient for computer to use and it must been converted into a database. In the database, each lexical entry is converted into a record. The formalization description of the records is as follow: <lexical entry> ::= <NO.><morphology> <part-of-speech><definition> Where NO. is the corresponding number of this lexical entry in Hownet. And the definition is composed of several sememes (short for SU) which were divided by comma. In addition, we have deleted the English sememes in order to saving space and speeding up the processing. Here are some examples after preprocessing: NO. Morphology Part-of of-speech definition 21424 俭朴 ADJ 属性值, 举止, 俭, 良 18888 坏 ADJ 属性值, 好坏, 坏, 莠 18889 坏 V 损害 18887 坏 V 坏掉 18890 坏 N 念头, 恶 3.2 The Creation Of Sememe Co-occurrence Frequency Database The sememe co-occurrence frequency database is the basic of sense disambiguation. Now we will introduce it briefly. The sememe co-occurrence frequency database is a table of two dimension. Each item corresponding to the co-occurrence frequency of a pair of sememes. Before introducing the sememe co-occurrence frequency database, we gave the following definition: Definiton: suppose word W has m sense items in hownet, and the corresponding definition of each sense item is: y 11, y 12,, y 1(n1) ; y 21, y 22,, y 2(n2) ; ; y m1,y m2,, y m(nm) respectively. We call {y i1,y i2,, y i(ni) }a sememe set of W(short for SS), and call {{ y 11, y 12,, y 1(n1) },{ y 21, y 22,, y 2(n2) },, { y m1,y m2,, y m(nm) }}the sememe expansion of W (short for SE). For example, in the above mentioned example, the word 俭朴 has only one sense item. The corresponding sememe set of this

sense item is { 属性值, 举止, 俭, 良 } and the sememe expansion of 俭朴 is {{ 属性值, 举止, 俭, 良 }}. The word 坏 has four sense items, and the corresponding sememe set of each item is { 属性值, 好坏, 坏, 莠 },{ 损害 },{ 坏掉 } and { 念头, 恶 } respectively. The sememe expansion of word 坏 is {{ 属性值, 好坏, 坏, 莠 },{ 损害 },{ 坏掉 },{ 念头, 恶 }} When building the sememe co-occurrence frequency database, the corpus is segmented first and each word is tagged with its sememe expansion in Hownet. Then for each unique pair of words co-occurred in a sentence (here a sentence is a string of characters delimited by punctuations.), the co-occurrence data of sememes which belong to the definition of each words respectively were collect. When collecting co0occurrence data, we adopt a principle that every pair of word which co-occurred in a sentence should have equal contribution to the sememe co-occurrence data regardless of the number of sense items of this word and the length of the definition. Moreover, the contribution of a word should be evenly distributed between all the senses of a word and the contribution of a sense should been evenly distributed between all the sememe in a sense. The algorithm is as follow: 1.Initial each cell in the sememe co-occurrence frequency database(short for SCFD) with 0. 2.For each sentence S in training corpus, do 3-7. 3.For each word in sentense S, tag the sememe expansion to it. 4.For each unique pair of sememe expansion (SE i,se j ), do 5-7. 5.For each sememe SU imp in each sememe set SS im in SE i, do 6-7. 6.For each sememe SU jnq in each sememe set SS jn in SE j, do 7. 7.Increase the value of cell SCFD(SU imp, SU jnq ) and SCFD(SU jnq,su imp ) by the product of w(su imp ) and w(su jnq ). Where w(su xyz ) is weight of SU xyz given by W ( SU xyz 1 ) = SE S S It can be concluded from the above algorithm that the SCFD are symmetrical. In order to saving space and speeding up the processing, we only save those cells (SU i,su j ) that satisfying SU i SU j. 3.3 The Sememe Co-occurrence Frequency Database Based Disambiguation Method 3.3.1 The Sememe Co-occurrence Frequency Based Scoring Method When disambiguate a polysemous word, we given the following equation as the score of a sense item of the polysemous word and the context containing this polysemous word. The context of the word is the sentence containing this word. S, C) (1) = C) GlobalSS) Where S is a sense item of polysemouse word W, C is the context containing W, SS is the corresponding sememe set of S, C is the set of sememe expansion of words in C and GlobalSS is the sememe set that containing all of the sememe defined in Hownet. C) = SE) C (2) SE C for any sememe set SS and sememe expansion set C. SE) = max SS) (3) SS SE for any sememe set SS and sememe expansion SE. SS) = SU ) SS SS for any sememe set SS and SS. x xy (4)

SU ) = SU, SU ) SS SS (5) for any sememe set SS and sememe SU. score ( SU, SU ) = I( SU, SU ) (6) for any sememe SU and SU. 2 f ( SU, SU ) N I ( SU, SU ) = log2 (7) g( SU ) g( SU ) Where f(su,su ) is the co-occurrence frequency corresponding to sememe pair (SU, SU ) in SCFD. And for g(su) and N, we have the following equation: g ( SU ) = f ( SU, SU ) (8) N = f ( SU, SU ) 2 (9), In equation (7), the mutual-informationlike measure deviated from the stardard mutual-information measure by multiple a extra multiplicative factor N, this is because that the scale of the corpus is not large enough that the mutual-information of some sememes pairs would be negtive if it was not normalized by a extra multiplicative factor N. In equation (9), the sum of f(su,su) was divided by 2, this is because for each pair of sememes, f ( SU, SU ) is increase by 2., When disambiguation, we tag the sememe T that satisfying the following equation to polysemous word W. T = arg max S, C) (10) S 3.3.2 The Creation Of Mutual Information Database We have created a mutual information database according to (7),(8) and(9) Here is some examples: The examples in table 1 have a high mutual information. The sememe pairs in this table have certain semantic relations. While the examples in table 2 have a low mutual information. And the sememe pairs in this table have no patency semantic relations. Table 1: example of sememe pairs which have a high mutual information Sememe 1 Sememe 2 Sememe 1 Sememe 2 ation tion 赌博 寻欢 33.811057 表情 羞愧 27.418417 鼓吹 夸大 29.441937 昏迷 醒 27.234630 光洁度 摸 28.024560 味道 香 27.093292 跑 气喘 28.023521 慢待 漠 26.984521 使净 整理 27.571478 低植 蔬菜 26.710478 Table 2: example of sememe pairs which have a low mutual information Sememe 1 Sememe 2 Sememe 1 Sememe 2 ation tion 食品 政 8.693242 合作 末 9.171023 交往 医 8.754611 侧 液 9.357734 车 圆 8.793914 驱赶 正误 9.448947 合作 疾病 9.121846 程度 交换 9.528801 机构 疾病 9.150412 禽 主次 9.599495 It can been concluded from table 1 and table 2 that the mutual information can reflect

the tightness of semantic relations. 4. Experiment And Analysis We did the experiment on a corpus of 10,000 characters from People s Dialy. Firstly, the corpus is segmented, and then the sememe co-occurrence frequecny database and mutual information database is created. In the mutual-information database, there is 709,496 data items corresponding to different sememes pairs. In order to speeding up the processing, the mutual-information database was sorted and indexed according to the first two bytes of each sememe pair. At last the experiment of disambiguation of some polysemous words was done. Here is two examples: Example 1: 全 省 两万四千 多 名 党政 干部 累计 处理 信访 案 十万 余 件 Example 2: 这 是 香港 海关 今年 破获 的 第 一 宗 来自 内地 的 文物 走私 案 We use the following euqation to access the accuracy ratio of disambiguation: the numberof correctlytagged examples accuracy ratio = thetotal numberof examplesin testing set (11) the experimental result is shown in table 4. Table 3: Two examples that disambiguate using sememe co-occurrence frequency database The score of sense items and The score of sense items and The definition of the context of word 案 in the context of word 案 in word 案 example 1 example 2 文书 14.459068 8.659968 事情 9.817648 10.817648 事情警 7.415986 12.415986 家具放置 -0.134779-0.134779 语文提出商讨辩论 -0.818518-0.818518 最大同现频率 14.459068 12.415986 排歧结果 文书 事情警 Total number of testing examples Table 4: the experiment result The number of correctly tagged examples Accurracy ratio Close test 100 75 75% Open test 100 71 71% The disambiguation method introduced above have the following charatristics: (1) The problem of data spraseness is solved in a large degree. (2) This disambiguation method avoids the laborious hand tagging of training corpus. (3) This method can been easily applied to other kind of corpus. Reference [1]. Nancy Ide, Jean Veronis, Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art, Computational Linguistics, 1998, Volume 24, number 1, pp 1-40 [2]. Philip Resnik, David Yarowsky, A Perspective on Word Sense Disambiguation Methods and their

Evaluation, http://www.cs.jhu.edu/~yarowsky/pubs.ht ml [3]. Alpha K. Luk, Statistical Sense Disambiguation with Relatively Small Corpus Using Dictionary Definitions, 33rd Annual Meeting of the Association for Computational Linguistics,26-30 June, 1995, Massachusetts Institute of Technology, Cambridge, Massachusetts. USA, pp.181-188 [4]. 董振东, 语义关系的表达和知识系统的建造, 语言文字应用, 1998 年第 3 期, 总第 27 期, pp.76-82 [5]. 董振东, 知网,http://www.how-net.com. [6]. Kenneth Ward Church, Word Association Norms, Mutual Information, and Lexicography, Computational Linguistics, 1990,Volume 16, Number 1, pp.22-29