Hindi Word Sense Disambiguation

Similar documents
Cambridge Assessment International Education Cambridge International Advanced Level. Published

वध म न मह व र ख ल व व लय, क ट

POSTGRADUATE DEGREE PROGRAM

Education Administration and Planning श क श सन एव नय जन BAED 302

CSIR Diamond Jubilee Technology Award (CDJTA)

Your Case Studies. Case Study (approximately 30 minutes including Questions)

Rules/Provisions > CONSAM Special Employee s BENEFIT PLAN (CSEBP)/Gratuity, Employee s Remuneration, Scales & Calculations

AV.22025/02/2014-PMR Government of India, Directorate General of Civil Aviation Opposite Safdarjung Airport, New Delhi

vlk/kj.k izkf/dkj ls izdkf'kr अ धस चन

BC 10. (Business Organization & Management)

Rãma Koti. A book for Likhita Japa. Instructions on how to write Likhita Japa. Published by The Saranaagathi Team

Unelected bodies like RBI are vital for democracy

vlk/kj.k izkf/dkj ls izdkf'kr

INNOVATIVE ASSESSMENT SYSTEM TM

Management and Productivity Sectional Committee, MSD 4

Visit For More Hindi Books -

レベル 2 ヒンディー語 단계 2 힌디어 北印度语. Course Content. Contenido del curso Contenu du cours Kursinhalt Contenuto del corso Conteúdo do curso コース内容 코스 컨텐트 북 课文

Nonfinite Verb. Infinitive / Gerund / Participle. Mishra English Study Centre BY M. K. Mishra

स घ ल क स व आय ग UNION PUBLIC SERVICE COMMISSION Online Recruitment Application (ORA) Cell GENERAL TECHNICAL ISSUES

NOTIFICATION - 02 /2015

vlk/kj.k izkf/dkj ls izdkf'kr

कमल श वर कम#$र ( )

QUESTION BANK क आठव वषय ह द

State Eligibility Test Teaching & Research Aptitude. Paper (Model Answer Key)

vlk/kj.k izkf/dkj ls izdkf'kr ubz fnyyh] 20 twu] 2016

vlk/kj.k izkf/dkj ls izdkf'kr व त

vlk/kj.k EXTRAORDINARY Hkkx I [k.m 1 PART I Section 1 izkf/dkj ls izdkf'kr No. 178] NEW DELHI, THURSDAY, JULY 2, 2015/ASHADHA 10, 1937

BANARAS HINDU UNIVERSITY INSTITUTE OF MANAGEMENT STUDIES

HOLIDAY ASSIGNMENT ( ) CLASS VI

State Eligibility Test Teaching & Research Aptitude. Paper (Model Answer Key)

क य लय र ज करण अ धक र वध नसभ म क 209 मह ब एल ओ स परव इजर क ज नक र Ø. eks-u- dsunz Ø. ernku dsunz dk uke. ernku. lqijckbtj dk uke

vlk/kj.k izkf/dkj ls izdkf'kr

LOYOLA INTERNATIONAL SCHOOL SYLLABUS

Machine Tools, Machine Tool Elements and Holding Devices Sectional Committee, PGD 35

BHARAT ELECTRONICS LIMITED (A Govt. of India Enterprise under the Ministry of Defence)

Max. Marks: 35. Unit-I: PowerPoint-I

Brij Bhushan Lal Public School Pilibhit Road

Internal Home Assignment (आन तर क ग ह क र य) BA Part-II SOCIOLOGY (SO-03& SO-04)

Assessment / Feedback / How to improve :... Score / क र: 1. Relevant / स गक :

(Important News Clippings) Cities at Crossroads: Setting cities free

Machine Tools, Machine Tool Elements and Holding Devices Sectional Committee, PGD 35

vlk/kj.k izkf/dkj ls izdkf'kr

vlk/kj.k Hkkx II [k.m 3 mi&[k.m (i) izkf/dkj ls izdkf'kr PUBLISHED BY AUTHORITY

vlk/kj.k izkf/dkj ls izdkf'kr

1 Uncorrected/Not for Publication SSS-MCM/2U/4.00. DR. E. M. SUDARSANA NATCHIAPPAN (CONTD.): That is the story

MCOM -07 अन तरर ष ट र य व यवसरययक (International Business)

A chowkidar is not enough

PUBLIC NOTICE 1 st May 2018

SSC CHSL 2018 MOCK TEST PAPER-2

PRACTICE DIRECTIONS FOR ELECTRONIC FILING (E-FILING) IN THE HIGH COURT OF DELHI

Household refrigerating appliances - Characteristics and test methods : Part 3 Energy consumption and volume (Adoption of IEC )

Machine Tools, Machine Tool Elements and Holding Devices Sectional Committee, PGD 35

Brij Bhushan Lal Public School

vlk/kj.k izkf/dkj ls izdkf'kr अ धस चन

स र श. २. भ ग लक स थत : to उततर आ श तथ to प वर द श त ३. क लम न च त रत त : 15,144 ह. रप टर कय गय तत ४. क ल स म जल वभ जक: 18

Why Modi s announcement of a competition to identify 20 top Indian universities will actually improve higher education

レベル 3 ヒンディー語 단계 3 힌디어 北印度语. Course Content. Contenido del curso Contenu du cours Kursinhalt Contenuto del corso Conteúdo do curso コース内容 코스 컨텐트 북 课文

What Aadhaar collects

vlk/kj.k izkf/dkj ls izdkf'kr

APPLICATION FORM FOR NATIONAL AWARD 2017

BHARTIYAM INTERNATIONAL SCHOOL

vlk/kj.k izkf/dkj ls izdkf'kr

Machine Tools, Machine Tool Elements and Holding Devices Sectional Committee, PGD 35

When ideology overcame sense

Professions, Trade, Calling and. for Application for Registration Shceme) क

BUREAU OF INDIAN STANDARDS

एम एच ड 15/ ट

Subject Code : 207. Subject Code. Q Id. Answer Key. Exam Date. Questions

Division of Agricultural Extension ICAR-INDIAN AGRICULTURAL RESEARCH INSTITUTE, NEW DELHI

vlk/kj.k स म जक य य और अ धक रत म लय अ धस चन नई द ल, 23 दस बर, 2015

Supreme Court has done well to modify national anthem order and scrutinise Section 377

Mr. Adam Smith Smith's Plastics 8 Crossfield Road Selly Oak Birmingham West Midlands B29 1WQ

Date: ET Editorials must do so action, but they have given interest of the peoplee it is wrong now. Parliament undermines the

Subject Code : 107. Subject Code. Q Id. Answer Key. Exam Date. Questions

vlk/kj.k izkf/dkj ls izdkf'kr

KENDRIYA VIDYALAYA MORENA

based on in British The exercise to

SUMMER HOLIDAYS HOME WORK CLASS 1

vlk/kj.k Hkkx II [k.m 3 mi&[k.m (i) izkf/dkj ls izdkf'kr

vlk/kj.k izkf/dkj ls izdkf'kr अ धस चन

HEY KIDS! HAVE A HAPPY, HAPPENING BREAK

DAV CENTENARY PUBLIC SCHOOL, PASCHIM ENCLAVE, NEW DELHI-87

vlk/kj.k izkf/dkj ls izdkf'kr

vlk/kj.k izkf/dkj ls izdkf'kr

NOTE: The technical content of document is not attached herewith / available on website. To get the document please contact:

ANNUAL SYLLABUS

Machine Tools, Machine Tool Elements and Holding Devices Sectional Committee, PGD 35

Subject Code : 105. Subject Code. Q Id. Answer Key. Exam Date. Questions

Division of Genetics ICAR- Indian Agricultural Research Institute, New Delhi WALK -ININTERVIEW (Unreserved under DBT Project)

CLASS XIA ENGLISH SOCIOLOGY

SAMPLE PAPER CLASS X MATHS Time: 3hrs. Marks : 80 SECTION-A

KENDRIYA VIDYALAYA NO.1 SAGAR (MP) Holiday Home Work For Summer Vacation Primary Section Class I st to V th SESSION

HERITAGE XPERIENTIAL LEARNING SCHOOL GRADE IX: ANNUAL EXAMINATION SYLLABUS SESSION S.no Subject Syllabus

प रल ख प र षण स ज ञ पन/DOCUMENT DESPATCH ADVICE स दभष /Ref. कदन क/Date प ज ड /PGD35(13448)W

Lal DAV Model School BN Block, Shalimar Bagh, Delhi (Affiliated and Accredited to CBSE)

In the Light of... Celebration of International Mother Language Day

क ष अन स ध न एव शक ष वभ ग DEPARTMENT OF AGRICULTURAL RESEARCH AND EDUCATION

ENGLISH. Month Subject/ Chapters Activity

PM s plenary address stirringly defends globalisation now deliver at home

Graduate Programmes Programme Summary & Fee Details वषय न म (Contents)

Transcription:

Hindi Word Sense Disambiguation Manish Sinha Mahesh Kumar Reddy.R Pushpak Bhattacharyya Prabhakar Pandey Laxmi Kashyap Abstract Word Sense Disambiguation (WSD) is defined as the task of finding the correct sense of a word in a specific context. This is crucial for applications like Machine Translation and Information Extraction. While the work on automatic WSD for English is voluminous, to our knowledge, this is the first attempt for an Indian language at automatic WSD. We make use of the Wordnet for Hindi developed at IIT Bombay, which is a highly important lexical knowledge base for Hindi. The main idea is to compare the context of the word in a sentence with the contexts constructed from the Wordnet and chooses the winner. The output of the system is a particular synset number designating the sense of the word. The mentioned Wordnet contexts are built from the semantic relations and glosses, using the Application Programming Interface created around the lexical data. The evaluation has been done on the Hindi corpora provided by the Central Institute of Indian Languages and the results are encouraging. Currently the system disambiguates nouns. Work is on for other parts of speech too. Keywords Hindi Wordnet, Text Similarity Measures, Semantic Relations in the Wordnet, Intersection Similarity 1. Introduction Word Sense Disambiguation (WSD) is defined as the task of finding the correct sense of the word in a context. The task needs large amounts of word and word knowledge. Let us consider the word स ब ध in the following Hindi sentence: ऋ व द क एक ऋच म द य क वश षण स उनक स क त एव व दक सम ज क स थ उनक स ब ध पर प ण क श पड़त ह उ ह अबत, म व च, अ, प ण, अय आ द कह गय ह Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai India {manish, mahesh, pb,pandey,yupu}@cse.iitb.ac.in From the Hindi Wordnet, we find that there are 6 senses of स ब ध, viz, 1. स ब ध, स ब ध, मतलब, न त, त ल क़, व त, र त - कस क र क लग व य स पक :"इस क म स र म क क ई स ब ध नह ह " 2. स ब ध क रक, ष, स ब ध, स ब ध, स ब ध क रक - य करण म वह क रक जसस एक श द क दसर श द क स थ स ब ध स चत ह त ह :"स ब ध क रक क वभ क, क, क, र, र र आ द ह ज स यह कस क प तक ह?" 3. लग व, स ब ध, स ब ध, स सग - द व त ओ म कस क र क लग व य स पक बतल न व ल त व:"स थ रहत -रहत त ज नव र स भ लग व ह ज त ह " 4. स ब ध, स ब ध, र त - वव ह अथव उसक न य:"म गल क लए बल सप र म स ब ध प क ह गय ह " 5. स ब ध, स ब ध - एक स थ ब धन, ज ड़न य मलन क बय :" म-भ व स आपस स ब ध म ग ढ़त आत ह " 6. न त, र त, स ब ध, स ब ध - मन य क वह प र प रक स ब ध ज एक ह क ल म ज म ल न अथव वव ह आ द करन स ह त ह :"मध रम स आपक य न त ह?" Figure 1.2: Senses of स ब ध obtained from the Wordnet In this particular case, sense 1 is the most appropriate one, though sense 5 and 6 too are relevant. 1.1 Related Work for English Yarowsky proposed a solution to WSD using the thesaurus and a supervised learning approach [3]. Word Figure 1.1: One of the possible usage of स ब ध Hindi Wordnet [8] is an important lexical resource developed at IIT Bombay, India.

associations are recorded and for an unseen text, the senses of the words are detected from the learnt associations. Aggire and Rigau uses a measure based on the proximity of the text words in Wordnet (Conceptual Density) to disambiguate the words [4]. The idea that translation presupposes WSD is given by Nancy lde. to disambiguate words using bilingual corpora [2]. The design of the well-known work-bench for sense disambiguation for WASP is described by Kilgarriff [5]. Lin [16] and Lesk [17] have studied theoretical definitions of similarity and provided word similarity measureswhich are hypernymy based and gloss based respectively. 2. Wordnet Principle Wordnet is an online lexical reference system whose design is inspired by current psycholinguistic theories of human lexical memory [1]. Each word meaning can be represented by a set of word-forms known as synonym sets or synsets. Synsets are created for content words, i.e., for Noun, Verb, Adjective and Adverb. Relation Hypernymy/Hyponymy Entailment/Troponymy Meronymy/Holonymy Meaning Is-A (Kind-Of) Manner-Of (for verbs) Has-A (Part-Whole) Table 2.2: Illustrating the nature of the relations in Wordnet For instance, we have the synset {घर, ग ह}. The hypernymy relation (Is-A) of it links to {आव स, नव स}. Its meronymy relation (Has-A) links to {आ गन} {बर मद } and {अ ययन क } and hyponymy relation to {ब ड़ }, {सर य} and {झ पड़ }. 2.1 Lexical Matrix The following table- called Lexical Matrix- is an abstract representation of the organization of lexical information. Word forms are imagined to be listed as headings for the columns and word meanings as headings for the rows. Rows express synonymy while columns express polysemy. Word Meanings Word-Forms F 1 F 2 F 3.. F n M 1 E 1,1 E 1,2 M 2 E 2,2 M 3 E 3,3.. M m E m,n Table 2.1: Illustrating the concept of Lexical Matrix For example, the synset {कलम, प न, क़लम, ल खन } gives the meaning उपकरण जसक सह यत स क गज़ आ द पर लखत ह. कलम belongs to a synset whose members form a row in the lexical matrix, and the row number gives a unique id to the synset. कलम has another meaning- प ड़ क वह टहन ज द सर जगह ब ठ न य दसर प ड़ म प ब द लग न क लए क ट ज ए- which comes in the column headed by this word. 2.2 Semantic Relations in Wordnet Hindi Wordnet design is inspired by the famous English Wordnet [1]. The basic semantic relations are as follows: Figure 2.1: A small part of the Hindi Wordnet 2.3 Wordnet Application Programming Interface The WSD task needs various information from the Wordnet, which in turn calls for the availability of an Application Programming Interface to the Wordnet. Figure 3.1 shows the organization of the API. To take a particular example, findtheinfo() function receives input arguments as word form, syntactic category, search type (e.g., hypernymy) and sense number. This will return the search type (i.e., hypernymy) output in a buffered form. These APIs are meant to do followings: (1) Morphological Processing (2) Database Searching (3) Utilities. Morphological processing routines extract the

stem from the word. Database search functions are used retrieve information from the Wordnet. Utilities are useful in other operations which might be useful to process words. findtheinfo getindex in_wn index_lookup read_synset free_synset Hindi Data (XII) Example Sentences of Meronyms (b) Measure the overlap between C and B using the intersection similarity measure. 3. Output that the sense s as the most probable sense which has the maximum overlap. Figure 3.1 gives the pictorial description of the basic idea of the strategy. The idea behind using the intersection similarity measure is to capture the belief that there will be high overlap between the words in the context and the related words found from the wordnet lexical and semantic relations and glosses. free_index morphstr parse_synset Text Document Hindi Wordnet Figure 2.2: Layers of Application Programming Interface around the Wordnet Context Semantic Relations 3. Methodology: Our Approach to WSD We describe a statistical technique for assigning senses to words in Hindi. A word is assigned a sense with the use of (i) the context in which it has been mentioned (ii) the information in the Hindi Wordnet and (iii) the overlap between these two pieces of information. The sense with the maximum overlap is the winner sense. WSD Algorithm: Finding the word s Sense 1. For a polysemous word w needing diambiguation, a set of context words in its surrounding window is collected. Let this collection be C, the context bag. 2. For each sense s of w, do the following (a) Let B be the bag of words obtained from the (I) Synonyms (II) Glosses (III) Example Sentences (IV) Hypernyms (V) Glosses of Hypernyms (VI) Example Sentences of Hypernyms (VII) Hyponyms (VIII) Glosses of Hypernyms (IX) Example Sentences of Hypernyms (X) Meronyms (XI) Glosses of Meronyms WSD Algorithm Figure 3.1 Extracting semantic relations from Wordnet and building context from the text for WSD 4. Components in the System 4.1 Parameters Wordnet relations: We have used hypernymy, hyponymy and meronymy relations. Since, these relations are semantic in nature; we obtain the synsets, their glosses and example sentences. We call the collection of words from words from Wordnet as the Semantic Bag. Word Context Size: The current sentence in which w is forms the most important context. We add to this the previous and the following sentences too. We call the collection of context words as the Context Bag. 4.2 Implementation Modules BuildContext: This module builds the context bag from the input document.

NounSemanticExtractor: This module builds the semantic bag by exploiting the semantic relations in the Wordnet. Input to this module is the polysemous word. Tokenizer: This module finds the unique tokens from the input document. This is an intermediate module required by BuildContext and NounSemanticExtractor. Intersection: This computes the overlap between the two input bags. Rank: This ranks the senses according the amount of intersection. The details of the functions used for the WSD task is given in Appendix-I. 5. Evaluation We use the Hindi corpora from the Central Institute of Indian Languages (CIIL), Mysore as the test bed for sense disambiguation. We do this task currently for nouns only. 5.1 Test Document Following is the part of a test document. The domain is sociology. आय ब हर स आय य इस द श क म ल नव स रह ह इस पर पय स मम प वप म त त क ज त ह ल कन इस म ल क ह भ ल दय ज त ह क आय न म क क ई ज त य ज त भ रह ह य नह आय क जब ऋ व द और उसक परव स ह य म कह ज त अथ म य ग ह ह नह त यह वव द कह स श ह त ह इस पर हमन इस श ध प ऽक क वगत अ क म वच र कय ह व त त आय श द ग ण परक ह और अथ म इसक य ग ह त ह त य गत च रऽ क स थ ह यव थ दश न आचरण स प स त आ द क य पक स दभ म ह न स इसक स थ ज त व षण क म उस समय ह न व भ वक ह जब हम स यव दय क इस मन व क म यम बन ज त ह क भ रतवष म एक र ह न क मत ह नह ह सभ वण क उ प न करन क मत व ल इस द श क स ब ध म भ गभ श य क व वग भ इस प म वच र त त कर रह ह क भ रतवष म ह आ द स हई और यह स व म म नस वक स हआ ऋ व द स ल कर म नवधम श तक ज त व षण क ज त य मलत ह व इस क समथ न म न व व द प स ह व दक स ह य म ज वच र यवह र और स तगत व भ नत ए आत ह और उनम स घष क भ थ त चलत रहत ह उसपर आय अन य ज त भ द करन क अप हम व दक अव दक भ द करन अ धक सम च न त त ह त ह वण श द क य ग र ग क अथ म ऋ व द म पय ह त ह क त उसक य ग स म जक सम ह क अथ म भ हआ ह क ण और श ल वण क पर पर स घष म यह भ य न द न आव यक ह क ऐस स घष ज तगत नह ह जस आय वड़ ज त क प म वभ करत हए उ र द ण क स घष क प दय ज य ज द न क च क गय ह इनक स ब ध आचरणगत भ ह एक ह प रव र म श ल और क ण वण क भ ह न क उ ल ख ह इस आध र पर यह तक त त करन क वड़ क द वत शव और आय क द वत व ण म सम वय क बय रह ह ह य पद लगत ह य क व ण क ण वण क ह और शव कप र वण क इसम त व ण क श ल और शव क क ण ह न च हए थ यह द वम डल क वभ जन करन व ल वण क थ न पर शव और व ण क य वह रक व प क आध र बन द त ह इस य वह रक आध र पर य द प र व षण द ख ज य त आय ज त न ह कर आचरणगत त क त क ह ग ऋ व द म रण स प ह क वण श द आचरण एव यव त गत त स वश ष प म ज ड़ हआ ह और उसक य ग ज त क अथ म भ हआ ह एक थ न पर ण क द य और श क अस र कह गय ह यह अस र श द क य ग म य पक प स ग ण और स म जक सम ह द न अथ क य ग प य ज त ह ऋ व द क ऋच क त र य ण क स थ ज ड़ कर द खन पर प ह त ह क अस र श द क य ग ज त क अथ म हआ ह ज श क आचरण क य न म रखकर कय गय ह इसस यह न कष नक लन क खतरन क य स हआ ह क श क लए अस र श द क य ग स प ह क यह ज त आय अन य क द स व क प म व हई ह व त त अस र श द क य ग व ण और इ क लए भ हआ ह अस र श द क क ई ढ़ अथ ल न ह म मक ह द स द य और आय भ द क त ह नह ह कह कह उ ल ख आत ह जह व दक और द य ओ म भ द करन क लए इ स थ न क गय ह य द क ण वण क आध र पर द य भ न ह त त इस क र क थ न य क ज त इतन अव य ह क द य ओ क वर ध स व दक क र क लए ब रब र थ न क गय ह इन उ रण स प ह त ह क द य ओ म श र रक नह आच रगत भ नत रह ह व दक क ब ह मत और द य ओ क अय कह गय ह द य और द स कह भ न और कह एक ह म न गय ह एक ह ऋच म द न क न म आय ह और भ न अथ क तन करत ह द न क सम ह अव दक आचरण क प म प य ज त ह व दक क वपर त स घष म द न मऽ भ ह ज त ह फलत व दक क लए द न सम न प स शऽ ह गय और उनक वन श क लए सम न प स इ स थ न क गय ह द न क लए आय वश षण स प ह क उनक व दक वर ध क क रण व च रक स तक और यव थ गत ह द न क अबत कह गय ह जसक त पय ह क द न उस वच रध र क रह ह ज य धम क वर ध करत ह ऋ व द क एक ऋच म द य क वश षण स उनक स क त एव व दक सम ज क स थ उनक स ब ध पर प ण क श पड़त ह उ ह अबत म व च अ प ण अय आ द कह गय ह इसम प ण श द धन स चय क लए आय ह अथ त ज व दक व ध क अन स र धन क वतरण नह करत ह उस भ द य कह गय ह अन क थ न पर द स क प ण कह गय ह यह ज भ वश षण स मन आय ह उनस अ य त प ह क स र स घष यव थ गत ह इसम वण य ज त क मह वनह ह त उ वश षण स प ह क व दक यव त क म ल ध र य य ज वन क द य और द स वर ध करत ह और व य व उस व क र नह करत व व दक भ ष क अल व अ य भ ष क य ग करत ह व व दक यव थ म व न नह ह इ ह आध र पर इ ह अव दक कह गय ह इ इन पर वजय करत ह व दक क ल स ह र य क उ य व दक यव थ क र करन रह ह ब रब र इ क थ न क गय ह क द स स आय यव त क र क लए वह य कर म ल स घष आय यव थ क सव ग ण स र ण रह ह द स क स थ इस क र क स घष क वप ल म ण प य

ज त ह द स क स थ व दक क स घष म अन क क रण म स प म य ह द स क ध नन कह गय ह और इस स दभ म प ण श द क भ य ग ह त ह द स स प क स चय करत ह The results obtained from this particular document are shown in table 5.1. Word Synset Comment व ध ढ ग, र त, तर क़, श ल, र त, ढर, व ध, प त, तर क, त र, अ द ज़, अ द ज, क य व ध, क़ यद, क य श ल ग ण स ण, अ छ ई, ग ण, ख़ ब, ख ब Partially correct र य द श, र य, त, त, स ब Incorrrect प दल, प ट, प स ब ध स ब ध, स ब ध, मतलब, न त, त ल क़, त ल क, व त, र त, सर क र वण अ र, वण, आखर, हरफ़, हफ़ Incorrect अ क अ क, न य क, न टक अ क Incorrect भ द अ तर, असम नत, फक़, भ नत, भ द, अ श व च य तर क ख ड,अ श,टकड़,भ ग, ह स,अ ग, वभ ग,कल,चरण Table 5.1: Results obtained from the test document कल, फ़न, हनर, व Partially य, य स, उ म, उ ग, य, क शश, च, प रव न म न म वच र वच र, ख़य ल, म त य Partially र ग र ग धन म लधन, प ज, असल, म ल, धन Partially वर ध तव द, ख डन, वर ध Partially य ग य ग Incorrect य न म त, य द, स ध, स ध, य न, ख़य ल, Incorrect खय ल आध र आध र वग ण, दज, वग, क ट This way we tested the system on documents from various domains. Table 5.2 summarises the results. Domain Children Literature Mass Media Short Story Sociology Science and Sociology Agriculture Accuracy Science History 0 20 40 60 80 Percentage of Accuracy Figure 5.1: Histogram showing the WSD accuracy across domains for Hindi Words Domain Percentage of Accuracy Agriculture 71.28 Science and Sociology 50 Sociology 48.34 Short Story 52.97 Mass-Media 45.45 Children Literature 37.78 History 42.85 Science 62.5 Table 5.2: WSD accuracy across domains for Hindi words 6. Conclusion and Future Work In this paper we have used the Hindi Wordnet for a fundamental NLP task, viz., disambiguation of Hindi words. To our knowledge, this is the first attempt at automatic WSD for an Indian language and is a significant step towards Indian language processing. As can be seen, our accuracy values range from about 40% to about 70%. The performance can surely be improved if morphology is handled exhaustively. The system currently does not detect the underlying similarity in presence of morphological variations. Since Indian languages are rich in morphology, exhaustive preprocessing for morphology is crucial in the whole WSD process. Our system currently deals with only nouns. Work is on to include words of other parts of speech. The obstacle there is the shallowness of the lexical network for nonnoun words. With the enrichment of- for example, the verb hierarchy [18] - the system performance is expected to be very impressive.

7. References [1] Fellbaum Christiane, editor the WordNet: An electronic Lexical database. MIT Press, Map 1998 [2] Nancy Ide. Parallel Translation as Sense Discriminators. In Proceedings of SIGLEX99, Washington D.C. USA 1999 [3] David Yarowsky Word Sense Disambiguation using statistical model of Roget s categories trained on large corpora. In Proceedings of the 14 th International Conference on Computational Linguistics (COLING-92), pages 454-460, Nantes, France, 1992 [4] Aggire E. and Rigau G. Word Sense Disambiguation using Conceptual density In Proceeding of COLING 96. [5] Adam Kilgrriff. Gold standard Data-sets for Evaluating Word Sense Disambiguation Programs. In Computer Speech ansd Languages 12 (4), Special Issue on Evaluation, 1998. [6] J. Pearl. In Probablistic Reasoning in Intelligent Systems: Networks of Plausible Inference Morgan K. Publishers, Inc. [7] G. Ramakrishnan Deepa Prithviraj B. P. Bhattachryya S. Chakrabarti Soft Word Sense Disambiguation, GWC- 2003. [8] D. Chakrabarti D. Narayan P. Pandey P. Bhattacharyya An Experience in Building the Indo- WordNet-A WordNet for Hindi. GWC- 2002. [9] S. Jha D. Narayan P. Pande P. Bhattacharyya A Wordnet for Hindi Workshop on Lexical Resources in Natural Language Processing, India 2001 [10] D. Narayan and P. Bhattacharyya Using Verb Noun Association for Word Sense Disambiguation International Conference on Natural Language Processing (ICON 2002), Mumbai, India, December, 2002. [11] C. Manning, H. Schutza Foundations of Statistical Natural Language Processing Word Sense Disambiguation Chapter, The MIT press, Cambridge, Massachusetts, London, England.. [12] Dan Klein, Kristina Toutanova, H. Tolga Ilhan, Sepandar D. Kamvar, and Christopher D. Manning. Combining Heterogeneous Classifiers for Word-Sense Disambiguation. In Workshop on Word Sense Disambiguation: Recent Successes and Future Directions at ACL 40, pages 74-80, 2002. [13] H. Tolga Ilhan, Sepandar D. Kamvar, Dan Klein, Christopher D. Manning, Kristina Toutanova. Combining Heterogeneous Classifiers for Word-Sense Disambiguation. Proceedings of SENSEVAL-2, the Second International Workshop on Evaluating Word Sense Disambiguation Systems, pp. 87-90, 2001. [14] Yarowsky, D. Hierarchical Decision Lists for Word Sense Disambiguation Computers and the Humanities, 34(2):179-186, 2000. [15] Resnik P. and D. Yarowsky Distinguishing Systems and Distinguishing Senses: New Evaluation Methods for Word Sense Disambiguation Natural Language Engineering, 5(2), pp. 113-133, 2000. [16] D. Lin An Information-Theoretic Definition of Similarity. Proceedings of International Conference on Machine Learning, Madison, Wisconsin, July, 1998. [17] M. E. Lesk Automatic Sense Disambiguation Using Machine Readable Dictionaries: How to Tell a Pine Cone from an Ice Cream Cone Proc. 1986 SIGDOC Conference, Toronto, Ontario, June, 1986 [18] D. Chakrabarti and P. Bhattacharyya, Creation of English and Hindi Verb Hierarchies and their Application to Hindi WordNet Building and English-Hindi MT, (GWC-2004), Czech Republic. [19] Hindi Wordnet from Center for Indian Language Technology Solutions, IIT Bombay, Mumbai, India http://www.cfilt.iitb.ac.in/wordnet/webhwn/ [20] Hindi Corpora from Central Institute of Indian Languages, Mysore India. http://www.ciil.org Acknowledgement The research is supported by a grant from Ministry of Information Technology and Communications, Government of India, New Delhi and Develop Gateway Foundation. Appendix - I Hindi Wordnet API used for WSD.

char * morphstr (char *origstr, int pos): Finds the base form of the word origstr (original string) in the specified pos. The first call (with origstr specified) returns a pointer to the first base form found. Subsequent calls requesting base forms of the same string must be made with the first argument of NULL. When no more base forms for origstr can be found, origstr itself is returned. unsigned int in_wn (char *searchstr): Finds the part-of-speech. Returns an unsigned integer with a bit set corresponding to each syntactic category containing searchstr. 0 is returned if searchstr is not present in pos. IndexPtr index_lookup (char *searchstr, int pos): Finds searchstr in the index file for pos. Returns a pointer to the parsed entry in an Index data structure. Returns NULL if a match is not found. char * findtheinfo (char *searchstr, int pos, char *ptr_type, int sense_num): Searches the database for relational information of a word. Returns a pointer to the text buffer. ptr_type gives the pointer to the relations in Wordnet and sense_num is the particular sense number.