Recognizing Names in Biomedical Texts using Hidden Markov Model and SVM plus Sigmoid

Similar documents
Comparative Study of Word Alignment Heuristics and Phrase-Based SMT

Handout #5. Introduction to the Design of Experiments (DOX) (Reading: FCDAE, Chapter 1~3)

Music Performer Recognition Using an Ensemble of Simple Classifiers

A Genetic Programming Framework for Error Recovery in Robotic Assembly Systems

Real-time Scheduling of Flexible Manufacturing Systems using Support Vector Machines and Neural Networks

Scheme For Finding The Next Term Of A Sequence Based On Evolution {File Closing Version 4}. ISSN

Exploiting the Marginal Profits of Constraints with Evolutionary Multi-objective Optimization Techniques

Use the template below as a guide for organizing the text of your story.

A Realistic E-Learning System based on Mixed Reality

Cost Control of the Transmission Congestion Management in Electricity Systems Based on Ant Colony Algorithm

Rank Inclusion in Criteria Hierarchies

Positive-living skills for children aged 3 to 6

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

Object Modeling for Multicamera Correspondence Using Fuzzy Region Color Adjacency Graphs

Following a musical performance from a partially specified score.

Organic Macromolecules and the Genetic Code A cell is mostly water.

Instructions for Contributors to the International Journal of Microwave and Wireless Technologies

Minimum Penalized Hellinger Distance for Model Selection in Small Samples

A BROADCASTING PROTOCOL FOR COMPRESSED VIDEO

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Heterogeneous Talent and Optimal Emigration 1

Statistics AGAIN? Descriptives

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

Accepted Manuscript. An improved artificial bee colony algorithm for flexible job-shop scheduling problem with fuzzy processing time

References and quotations

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

RIAM Local Centre Woodwind, Brass & Percussion Syllabus

Modeling Form for On-line Following of Musical Performances

tj tj D... '4,... ::=~--lj c;;j _ ASPA: Automatic speech-pause analyzer* t> ,. "",. : : :::: :1'NTmAC' I

11 Hybrid Cables. n f Hz. kva i P. Hybrid Cables Description INFORMATION Description

QUICK START GUIDE v0.98

Chapter 7 Registers and Register Transfers

Simon Sheu Computer Science National Tsing Hua Universtity Taiwan, ROC

Line numbering and synchronization in digital HDTV systems

Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

Image Intensifier Reference Manual

current activity shows on the top right corner in green. The steps appear in yellow

Craig Webre, Sheriff Personnel Division/Law Enforcement Complex 1300 Lynn Street Thibodaux, Louisiana 70301

System of Automatic Chinese Webpage Summarization Based on The Random Walk Algorithm of Dynamic Programming

DIGITAL SYSTEM DESIGN

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters!

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

Error Concealment Aware Rate Shaping for Wireless Video Transport 1

BOUND FOR SOUTH AUSTRALIA

Quantization of Three-Bit Logic for LDPC Decoding

Technical Information

Polychrome Devices Reference Manual

The Blizzard Challenge 2014

Natural Language Processing

3. Sequential Logic 1

Bibliometric Characteristics of Political Science Research in Germany

THE IMPORTANCE OF ARM-SWING DURING FORWARD DIVE AND REVERSE DIVE ON SPRINGBOARD

Cost-Aware Fronthaul Rate Allocation to Maximize Benefit of Multi-User Reception in C-RAN

Analysis of Subscription Demand for Pay-TV

AREA (SQ. FT.) BREAKDOWN: 1. SALES AREA: 2. ENTRY VESTIBULE (EXT.): 3. SERVICE: 4. TOILET ROOM: 5. OFFICE: 6. STAIRWAY/REAR EXIT: 7.

V (D) i (gm) Except for 56-7,63-8 Flute and Oboe are the same. Orchestration will only list Fl for space purposes

Color Monitor. L200p. English. User s Guide

A STUDY OF TRUMPET ENVELOPES

Why Take Notes? Use the Whiteboard Capture System

Chapter 3: Sequential Logic

Some Experiments in Humour Recognition Using the Italian Wikiquote Collection

Optimized PMU placement by combining topological approach and system dynamics aspects

AMP-LATCH* Ultra Novo mm [.025 in.] Ribbon Cable 02 MAR 12 Rev C

ABSTRACT. woodwind multiphonics. Each section is based on a single multiphonic or a combination thereof distributed across the wind

Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

COLLEGE READINESS STANDARDS

Hybrid Transcoding for QoS Adaptive Video-on-Demand Services

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

Burl Faywood. Gospel Keyboard Studies. Blest Be the Tie That Binds. bœœ œ œ œ œ œ œ. & b. œ œ œ œ œ œ œ œ œ œ œ œ œ œœ œ œ œ œ œ œ œ œ œ.

Automated composer recognition for multi-voice piano compositions using rhythmic features, n-grams and modified cortical algorithms

Anchor Box Optimization for Object Detection

Read Only Memory (ROM)

Working with PlasmaWipe Effects

the who Produced by Alfred Music P.O. Box Van Nuys, CA alfred.com Printed in USA. ISBN-10: ISBN-13:

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

Movies are great! Within a passage, words or phrases can give clues to the meaning of other words. This

LOW-COMPLEXITY VIDEO ENCODER FOR SMART EYES BASED ON UNDERDETERMINED BLIND SIGNAL SEPARATION

Phone-based Plosive Detection

Volume 20, Number 2, June 2014 Copyright 2014 Society for Music Theory

Lost on the Web: Does Web Distribution Stimulate or Depress Television Viewing?

Reduce Distillation Column Cost by Hybrid Particle Swarm and Ant

Environmental Reviews. Cause-effect analysis for sustainable development policy

Quality improvement in measurement channel including of ADC under operation conditions

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

US B2. ( *) Notice: Subject to any disclaimer, the term of this patent is extended or adjusted under 35 U.S.c. 154(b) by 0 days.

THE Internet of Things (IoT) is likely to be incorporated

NewBlot PVDF 5X Stripping Buffer

Practice Guide Sonata in F Minor, Op. 2, No. 1, I. Allegro Ludwig van Beethoven

Elizabeth H. Phillips-Hershey and Barbara Kanagy Mitchell

Turn it on. Your guide to getting the best out of BT Vision

Explanation on FY2015

Improving Reliability and Energy Efficiency of Disk Systems via Utilization Control

Daniel R. Dehaan Three Études For Solo Voice Summer 2010, Chicago

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

Novel Quantization Strategies for Linear Prediction with Guarantees

Romeo And Juliet II. Paris. Lord Capulet. Tybalt Benvolio. Juliet. Romeo. Mercutio. Juliet s Nurse. Shakespeare On Stage

To Bean or not to bean! by Uwe Rosenberg, with illustrations by Björn Pertoft Players: 2 7 Ages: 10 and up Duration: approx.

Transcription:

Recogzg Names Bomedcal Texts usg Hdde Markov Model ad SVM plus Sgmod ZHOU GuoDog Isttute for Ifocomm Research 2 Heg Mu Keg Terrace Sgapore 963 Emal: zhougd@2r.a-star.edu.sg ABSTRACT I ths paper, we preset a amed etty recogto system the bomedcal doma, called PowerBoNE. I order to deal wth the specal pheomea the bomedcal doma, varous evdetal features are proposed ad tegrated through a Hdde Markov Model (HMM. I addto, a Support Vector Mache (SVM plus sgmod s proposed to resolve the data sparseess problem our system. Fally, we preset two post-processg modules to deal wth the cascaded etty ame ad abbrevato pheomea. Evaluato shows that our system acheves the F-measure of 69. ad 7.2 o the 23 classes of GENIA V. ad V3.0 respectvely. I partcular, our system acheves the F-measure of 77.8 o the prote class of GENIA V3.0. It shows that our system outperforms the best publshed system o GENIA V. ad V3.0.. INTRODUCTION Wth a overwhelmg amout of textual formato molecular bology ad bomedce, there s a eed for effectve ad effcet lterature mg ad kowledge dscovery that ca help bologsts to gather ad make use of the kowledge ecoded text documets. I order to make orgazed ad structured formato avalable, automatcally recogzg bomedcal etty ames becomes crtcal ad s mportat for proteprote teracto extracto, pathway costructo, automatc database curato, etc. Such a task, called amed etty recogto, has bee well developed the Iformato Extracto lterature (MUC-6; MUC-7. I MUC, the task of amed etty recogto s to recogze the ames of persos, locatos, orgazatos, etc. the ewswre doma. I the bomedcal doma, we care about ettes lke gee, prote, vrus, etc. I recet years, may exploratos have bee doe to port exstg amed etty recogto systems to the bomedcal doma (Kazama et al 2002; Lee et al 2003; She et al 2003; Zhou et al 2004. However, few of them have acheved satsfactory performace due to the specal characterstcs the bomedcal doma, such as log ad descrptve amg covetos, cojuctve ad dsjuctve structure, causal amg coveto ad rapdly emergg ew bomedcal ames, abbrevato, ad cascaded costructo. O all accouts, we ca say that the etty ames the bomedcal doma are much more complex tha those the ewswre doma. I ths paper, we preset a amed etty recogto system the bomedcal doma, called PowerBoNE. I order to deal wth the specal pheomea the bomedcal doma, varous evdetal features are proposed ad tegrated effectvely ad effcetly through a Hdde Markov Model (HMM. I addto, a Support Vector Mache (SVM plus sgmod s proposed to resolve the data sparseess problem our system. Fally, we preset two postprocessg modules to deal wth the cascaded etty ame ad abbrevato pheomea to further mprove the performace. All of our expermets are doe o the GENIA corpus, whch s the largest aotated corpus the molecular bology doma avalable to publc (Ohta et al. 2002. I our expermets, two versos are used: Gea V. whch cotas 670 MEDLINE abstracts of 23K words; 2 Gea V3.0 whch s a superset of GENIA V. ad cotas 2000 MEDLINE abstracts of 360K words. The aotato of bomedcal ettes s based o the GENIA otology (Ohta et al. 2002, whch cludes 23 dstct classes: mult-cell, moo-cell, vrus, body part, tssue, cell type, cell compoet, orgasm, cell le, other artfcal source, prote, peptde, amo acd moomer, DNA, RNA, poly ucleotde, ucleotde, lpd, carbohydrate, other orgac compoud, orgac, atom ad other. 2. FEATURES I order to deal wth the specal pheomea the bomedcal doma, varous evdetal features are explored. Word Formato Patter (F WFP : The purpose of ths feature s to capture captalzato, dgtalzato ad other word formato

formato. Ths feature has bee wdely used the bomedcal doma (Kazama et al 2002; She et al 2003; Zhou et al 2004. I ths paper, the same feature as She et al 2003 s used. Morphologcal Patter (F MP : Morphologcal formato, such as prefx ad suffx, s cosdered as a mportat cue for termology detfcato ad has bee wdely appled the bomedcal doma (Kazama et al 2002; Lee et al 2003; She et al 2003; Zhou et al 2004. Same as She et al 2003, we use a statstcal method to get the most useful prefxes/suffxes from the trag data. Part-of-Speech (F POS : Sce may of the words bomedcal etty ames are lowercase, captalzato formato the bomedcal doma s ot as evdetal as that the ewswre doma. Moreover, may bomedcal etty ames are descrptve ad very log. Therefore, POS may provde useful evdece about the boudares of bomedcal etty ames. Head Nou Trgger (F HEAD : The head ou, whch s the major ou of a ou phrase, ofte descrbes the fucto or the property of the ou phrase. I ths paper, we automatcally extract ugram ad bgram head ous from the trag data, ad rak them by frequecy. For each etty class, we select 50% of top raked head ous as head ou trggers. Table shows some of the examples. Table : Examples of auto-geerated head ous Class Ugram bgram PROTEIN terleuk actvator prote terfero bdg prote kase cell receptor DNA DNA X chromosome cdna bdg motf chromosome promoter elemet Name Alas Feature (F ALIAS : Besdes the above wdely used features, we also propose a ovel ame alas feature. The tuto behd ths feature s the ame alas pheomeo that relevat ettes wll be referred to may ways throughout a gve text ad thus success of amed etty recogto s codtoal o success at determg whe oe ou phrase refers to the very same etty as aother ou phrase. Durg decodg, the etty ames already recogzed from the prevous seteces of the documet are stored a lst. Whe the system ecouters a etty ame caddate (e.g. a word wth a specal word formato patter, a ame alas algorthm (smlar to Schwartz et al 2003 s voked to frst dyamcally determe whether the etty ame caddate mght be alas for a prevously recogzed ame the recogzed lst. Ths s doe by checkg whether all the characters the etty ame caddate exst a recogzed etty ame the same order ad whether the frst character the etty ame caddate s same as the frst character the recogzed ame. For a relevat work, please see Jacquem (200. The ame alas feature F ALIAS s represeted as ENTITYLm (L dcates the localty of the ame alas pheomeo. Here ENTITY dcates the class of the recogzed etty ame ad dcates the umber of the words the recogzed etty ame whle m dcates the umber of the words the recogzed etty ame from whch the ame alas caddate s formed. For example, whe the decodg process ecouters the word TCF, the word TCF s proposed as a etty ame caddate ad the ame alas algorthm s voked to check f the word TCF s a alas of a recogzed amed etty. If T cell Factor s a Prote ame recogzed earler the documet, the word TCF s determed as a alas of T cell Factor wth the ame alas feature Prote3L3 by takg the three tal letters of the three-word prote ame T cell Factor. 3. METHODS 3. Hdde Markov Model Gve above varous features, the key problem s how to effectvely ad effcetly tegrate them together ad fd the optmal resoluto to bomedcal amed etty recogto. Here, we use the Hdde Markov Model (HMM as descrbed Zhou et al 2002. A HMM s a model where a sequece of outputs s geerated addto to the Markov state sequece. It s a latet varable model the sese that oly the output sequece s observed whle the state sequece remas hdde. Gve a observato sequece O = o o2... o, the purpose of a HMM s to fd the most lkely state sequece S s s... s that maxmzes = 2 S O. Here, the observato o =< f, w >, where s the word ad f =< F WFP w, F MP, F, F, F POS s the feature set of the word w, ad the state s s structural ad s = BOUNDARY _ ENTITY _ FEATURE, where BOUNDARY deotes the posto of the curret word the etty; ENTITY dcates the class of the etty; ad FEATURE s the feature set used to model the gram more precsely. By rewrtg S O, we have: HEAD ALIAS > 2

S, O S O = S + log ( S O The secod term Equato ( s the mutual formato betwee S ad O. I order to smplfy the computato of ths term, we assume mutual formato depedece:, O = MI( s, O = MI( S or s, O log = s O S, O log = (2 S O That s, a dvdual tag s oly depedet o the output sequece O ad depedet o other tags the tag sequece S. Ths assumpto s reasoable because the depedece amog the tags the tag sequece S has already bee captured by the frst term Equato (. Applyg the assumpto (2 to Equato (, we have: S O = S + = = s O s (3 From Equato (3, we ca see that: The frst term ca be computed by applyg cha rules. I gram modelg (Che et al 996, each tag s assumed to be depedet o the N- prevous tags. The secod term s the summato of log probabltes of all the dvdual tags. The thrd term correspods to the lexcal compoet (dctoary of the tagger. The dea behd the model s that t tres to assg each output a approprate tag (state, whch cotas boudary ad class formato. For example, TCF bds stroger tha NF kb to TCEd DNA. The tag assged to toke TCF should dcate that t s at the begg of a etty ame ad t belogs to the Prote class; ad the tag assged to toke bds should dcate that t does ot belog to a etty ame. Here, the Vterb algorthm (Vterb 967 s mplemeted to fd the most lkely tag sequece. The problem wth the above HMM les the data sparseess problem rased by P ( s O the thrd term of Equato (3. Ideally, we would have suffcet trag data for every evet whose codtoal probablty we wsh to calculate. Ufortuately, there s rarely eough trag data to compute accurate probabltes whe decodg o ew data. Geerally, two smoothg approaches (Che et al 996 are appled to resolve ths problem: lear terpolato ad back-off. However, these two approaches oly work well whe the umber of dfferet formato sources s lmted. Whe a few features ad/or a log cotext are cosdered, the umber of dfferet formato sources s expoetal. I ths paper, a Support Vector Mache (SVM plus sgmod s proposed to resolve ths problem our system. 3.2 Support Vector Mache plus Sgmod Support Vector Maches (SVMs are a popular mache learg approach frst preseted by Vapk (995. Based o the structural rsk mmzato of statstcal learg theory, SVMs seek a optmal separatg hyper-plae to dvde the trag examples to two classes ad make decsos based o support vectors whch are selected as the oly effectve examples the trag set. However, SVMs produce a ucalbrated value that s ot probablty. That s, the uthresholded output of a SVM ca be represeted as f ( x = a y k( x, x + b (4 SV To map the SVM output to the probablty, we tra a addtoal sgmod model(platt 999: p( s f = + exp( Af + B (5 Bascally, SVMs are bary classfers. Therefore, we must exted SVMs to mult-class (e.g. K classfers. For effcecy, we apply the oe vs. others strategy, whch bulds K classfers so as to separate oe class from all others, stead of the parwse strategy, whch bulds K*(K-/2 classfers cosderg all pars of classes. Moreover, we oly apply the smple lear kerel, although other kerels (e.g. polyomal kerel ad parwse strategy ca have better performace. Fally, for each state s, there s oe sgmod p( s f. Therefore, the sgmod outputs are ormalzed to get a probablty dstrbuto usg p( s f p ( s O =. p( s f 3.3 Post-Processg Two post-processg modules, amely cascaded etty ame resoluto ad abbrevato resoluto, are appled our system to further mprove the performace. Cascaded Etty Name Resoluto It s foud (She et al 2003 that 6.57% of etty ames GENIA V3.0 have cascaded costructos, e.g. <RNA><DNA>CIITA</DNA> mrna</rna>. Therefore, t s mportat to resolve such pheomeo. Here, a patter-based module s proposed to resolve the cascaded etty ames whle the above HMM s appled to recogze embedded etty 3

ames ad o-cascaded etty ames. I the GENIA corpus, we fd that there are sx useful patters of cascaded etty ame costructos: <ENTITY> := <ENTITY> + head ou, e.g. <PROTEIN> bdg motf <DNA> <ENTITY> := <ENTITY> + <ENTITY>, e.g. <LIPID> <PROTEIN> <PROTEIN> <ENTITY> := modfer + <ENTITY>, e.g. at <Prote> <Prote> <ENTITY> := <ENTITY> + word + <ENTITY>, e.g. <VIRUS> fected <MULTICELL> <MULTICELL > <ENTITY> := modfer + <ENTITY> + head ou <ENTITY> := <ENTITY> + <ENTITY> + head ou I our expermets, all the rules of above sx patters are extracted from the cascaded etty ames the trag data to deal wth the cascaded etty ame pheomeo. Abbrevato Resoluto Whle the ame alas feature s useful to detect the ter-setetal ame alas pheomeo, t s uable to detfy the er-setetal ame alas pheomeo: the er-setetal abbrevato. Such abbrevatos wdely occur the bomedcal doma. I our system, we preset a effectve ad effcet algorthm to recogze the er-setetal abbrevatos more accurately by mappg them to ther full expaded forms. I the GENIA corpus, we observe that the expaded form ad ts abbrevato ofte occur together va paretheses. Geerally, there are two patters: expaded form (abbrevato ad abbrevato (expaded form. Our algorthm s based o the fact that t s much harder to classfy a abbrevato tha ts expaded form. Geerally, the expaded form s more evdetal tha ts abbrevato to determe ts class. The algorthm works as follows: Gve a setece wth paretheses, we use a smlar algorthm as Schwartz et al 2003 to determe whether t s a abbrevato wth paretheses. Ths s doe by startg from the ed of both the abbrevato ad the expaded form, movg from rght to left ad tryg to fd the shortest expaded form that matches the abbrevato. Ay character the expaded form ca match a character the abbrevato wth oe excepto: the match of the character at the begg of the abbrevato must match the frst alphabetc character of the frst word the expaded form. If yes, we remove the abbrevato ad the paretheses from the setece. After the setece s processed, we restore the abbrevato wth paretheses to ts orgal posto the setece. The, the abbrevato s classfed as the same class of the expaded form, f the expaded form s recogzed as a etty ame. I the meawhle, we also adjust the boudares of the expaded form accordg to the abbrevato, f ecessary. Fally, the expaded form ad ts abbrevato are stored the recogzed lst of bomedcal etty ames from the documet to help the resoluto of forthcomg occurreces of the same abbrevato the documet. 4. EXPERIMENTS AND EVALUATION We evaluate our PowerBoNE system o GENIA V. ad GENIA V3.0 usg precso/recall/fmeasure. For each evaluato, we select 20% of the corpus as the held-out test data ad the remag 80% as the trag data. All the expermetatos are doe 5 tmes ad the evaluatos are averaged over the held-out test data. For cascaded etty ame resoluto, a average of 59 ad 97 rules are extracted from the cascaded etty ames the trag data of GENIA V. ad V3.0 respectvely. For POS, all the POS taggers are traed o the trag data wth POS mported from the correspodg GENIA V3.02p wth POS aotated. Table 2 shows the performace of our system o GENIA V. ad GENIA V3.0, ad the comparso wth that of the best reported system (She et al 2003. It shows that our system acheves the F-measure of 69. o GENIA V. ad the F-measure of 7.2 o GENIA V3.0 respectvely, wthout help of ay dctoares. It also shows that our system outperforms She et al (2003 by 6.9 F-measure o GENIA V. ad 4.6 F-measure o GENIA V3.0. Ths s largely due to the superorty of the SVM plus sgmod our system (mprovemet of 3.7 F-measure o GENIA V3.0 over the back-off approach She et al (2003 ad the ovel ame alas feature (mprovemet of.2 F-measure o GENIA V3.0. Fally, evaluato also shows that the cascaded etty ame resoluto ad the abbrevato resoluto cotrbute 3.4 ad 2. respectvely F-measure o GENIA V3.0. Table 2: Performace of our PowerBoNE system Performace P R F She et al o GENIA V3.0 66.5 66.6 66.6 She et al o GENIA V. 63. 6.2 62.2 Our system o GENIA V3.0 72.7 69.8 7.2 Our system o GENIA V. 70.4 67.9 69. 4

Table 3: Performace of dfferet etty classes o GENIA V3.0 Etty Number of staces F Class the trag data Cell Type 6034 8.8 Lpd 602 68.6 Mult-Cell 463 78. Prote 2380 77.8 DNA 7538 70.8 Cell Le 326 68.5 RNA 695 56.2 Vrus 873 67.2 Oe mportat questo s about the performace of dfferet etty classes. Table 3 shows the performace of some of the bomedcal etty classes o GENIA V3.0. Of partcular terest, our system acheves the F-measure of 77.8 o the class Prote. It shows that the performace vares a lot amog dfferet etty classes. Oe reaso may be due to dfferet dffcultes recogzg dfferet etty classes. Aother reaso may be due to the dfferet umbers of staces dfferet etty classes. Though GENIA V3.0 provdes a good bass for amed etty recogto the bomedcal doma ad probably the best avalable, t has clear bas. Table 3 shows that, whle GENIA V3.0 s of eough sze for recogzg the major classes, such as Prote, Cell Type, Cell Le, Lpd etc, t s of lmted sze recogzg other classes, such as Vrus. 5. ERROR ANALYSIS I order to further evaluate our system ad explore possble mprovemet, we have mplemeted a error aalyss. Ths s doe by radomly choosg 00 errors from our recogto results. Durg the error aalyss, we fd may errors are due to the strct aotato scheme ad the aotato cosstece the GENIA corpus, ad ca be cosdered acceptable. Therefore, we wll also exame the acceptable F-measure of our system, partcular, the acceptable F-measure o the prote class. All the 00 errors are classfed as follows: Left boudary errors (4: It cludes the errors wth correct class detfcato, correct rght boudary detecto ad oly wrog left boudary detecto. We fd that most of such errors come from the log ad descrptve amg coveto. We also fd that of 4 errors are acceptable ad gorace of the descrptve words ofte does ot make a much dfferece for the etty ames. I fact, t s eve hard for bologsts to decde whether the descrptve words should be a part of the etty ames, such as ormal, actvated, etc. I partcular, 4 of 4 errors belog to the prote class. Amog them, two errors are acceptable, e.g. classcal <PROTEIN>,25 (OH 2D3 receptor</protein> => <PROTEIN>classcal,25 (OH 2D3 receptor</protein> (wth format of aotato the corpus => detfcato made by our system, whle the other two are uacceptable, e.g. <PROTEIN>vral trascrpto factor</protein> => vral <PROTEIN>trascrpto factor</protein>. Cascaded etty ame errors (5: It cludes the errors caused by the cascaded etty ame pheomeo. We fd that most of such errors come from the aotato cosstece the GENIA corpus: I some cases, oly the embedded etty ames are aotated whle other cases, the embedded etty ames are ot aotated. Our system teds to aotate both the embedded etty ames ad the whole etty ames. Amog them, we fd that 3 of 6 errors are acceptable. I partcular, 2 of 6 errors belog to the prote class ad both are acceptable, e.g. <DNA>NF kappa B bdg ste</dna> => <DNA><PROTEIN>NF kappa B</PROTEIN> bdg ste</dna>. Msclassfcato errors (8: It cludes the errors wth wrog class detfcato, correct rght boudary detecto ad correct left boudary detecto. We fd that ths kd of errors maly comes from the sese ambguty of bomedcal etty ames ad s very dffcult to dsambguate. Amog them, 8 errors are related wth the DNA class ad 6 errors are related wth the Cell Le ad Cell Type classes. We also fd that oly 3 of 8 errors are acceptable. I partcular, there are 6 errors related to the prote class. Fally, we fd that all the 6 errors are caused by msclassfcato of the DNA class to the prote class ad all of them are uacceptable, e.g. <DNA>type I IFN<DNA> => <PROTEIN>type I IFN</PROTEIN>. True egatve (23: It cludes the errors by mssg the detfcato of bomedcal etty ames. We fd that 6 errors come from the other class ad 0 errors from the prote class. We also fd that the GENIA corpus aotates some geeral ou phrases as bomedcal etty ames, e.g. prote the prote ad cofactor a cofactor. Fally, we fd that of 23 errors are acceptable. I partcular, 9 of 23 errors related to the prote class. Amog them, 3 errors are acceptable, e.g. the <PROTEIN>prote</PROTEIN> => the 5

prote, whle the other 6 are uacceptable, e.g. <PROTEIN>80 kda</protein> => 80 kda. False postve (5: It cludes the errors by wrogly detfyg bomedcal etty ames whch are ot aotated the GENIA corpus. We fd that 9 of 5 errors come from the other class. Ths suggests that the aotato of the other class s much lack of cosstecy ad most problematc the GENIA corpus. We also fd that 7 of 5 errors are acceptable. I partcular, 2 of 5 errors are related to the prote class ad both are acceptable, e.g. affty stes => <PROTEIN>affty stes</protein>. Mscellaeous (4: It cludes all the other errors, e.g. combato of the above errors ad the errors caused by paretheses. We fd that oly of 4 errors s acceptable. We also fd that, amog them, 2 errors are related wth the prote class ad both are uacceptable, e.g. <PROTEIN>7 amo acd eptope</protein> => 7 <RNA>amo acd eptope</rna>. From above error aalyss, we fd that about half (46/00 of errors are acceptable ad ca be avoded by flexble aotato scheme (e.g. regardg the modfers the left boudares ad cosstet aotato (e.g. the aotato of the other class ad the cascaded etty ame pheomeo. I partcular, about oe thrd (9/25 of errors are acceptable o the prote class. Ths meas that the acceptable F-measure ca reach about 84.4 o the 23 classes of GENIA V3.0. I partcular, the acceptable F-measure o the prote class s about 85.8. I addto, ths performace s acheved wthout usg ay extra resources (e.g. dctoares. Wth help of extra resources, we thk a acceptable F-measure of ear 90 ca be acheved the ear future. 6. RELATED WORK Prevous approaches bomedcal amed etty recogto typcally use some doma specfc heurstc rules ad heavly rely o exstg dctoares (Fukuda et al 998, Proux et al 998 ad Gazauskas et al 2000. The curret tred s to apply mache learg approaches bomedcal amed etty recogto, largely due to the developmet of the GENIA corpus. The typcal exploratos clude Kazama et al 2002, Lee et al 2003, Tsuruoka et al 2003, She et al 2003. Kazama et al 2002 apples SVM ad corporates a rch feature set, cludg word feature, POS, prefx feature, suffx feature, prevous class feature, word cache feature ad HMM state feature. The expermet o GENIA V. shows the F-measure of 54.4. Tsuruoka et al 2003 apples a dctoary-based approach ad a aïve Bayes classfer to flter out false postves. It oly evaluates agast the prote class GENIA V3.0, ad receves the F-measure of 70.2 wth help of a large dctoary. Lee et al 2003 uses a two phase SVM-based recogto approach ad corporates word formato patter ad part-ofspeech. The evaluato o GENIA V3.0 shows the F-measure of 66.5 wth help of a etty ame dctoary. She et al 2003 proposes a HMM-based approach ad two post-processg modules (cascaded etty ame resoluto ad abbrevato resoluto. Evaluato shows the F-measure of 62.2 ad 66.6 o GENIA V. ad V3.0 respectvely. 7. CONCLUSION I the paper, we descrbe our HMM-based amed etty recogto system the bomedcal doma, amed PowerBoNE. Varous lexcal, morphologcal, sytactc, sematc ad dscourse features are corporated to cope wth the specal pheomea bomedcal amed etty recogto. I addto, a SVM plus sgmod s proposed to effectvely resolve the data sparseess problem. Fally, we preset two post-processg modules to deal wth cascaded etty ame ad abbrevato pheomea. The ma cotrbutos of our work are the ovel ame alas feature the bomedcal doma, the SVM plus sgmod approach the effectve resoluto of the data sparseess problem our system ad ts tegrato wth the Hdde Markov Model. I the ear future, we wll further mprove the performace by vestgatg more o cojucto ad dsjucto costructo, the syoym pheomeo, ad explorato of extra resources (e.g. dctoary. REFERENCES Che ad Goodma. 996. A Emprcal Study of Smoothg Techques for Laguage Modelg. I Proceedgs of the 34th Aual Meetg of the Assocato of Computatoal Lgustcs (ACL 996. pp30-38. Sata Cruz, Calfora, USA. Fukuda K., Tsuoda T., Tamura A., ad Takag T. 998. Toward formato extracto: detfyg prote ames from bologcal papers. I Proc. of the Pacfc Symposum o Bocomputg 98 (PSB 98, 707-78. Gazauskas R., Demetrou G. ad Humphreys K. 2000. Term Recogto ad Classfcato Bologcal Scece Joural Artcles. I Proc. of the Computatoal Termology for Medcal ad Bologcal Applcatos Workshop of the 2 d Iteratoal Coferece o NLP, 37-44. 6

Jacquem C. 200. Spottg ad Dscoverg Terms through Natural Laguage Processg, Cambrdge: MIT Press Kazama J., Mako T., Ohta Y., ad Tsuj J. 2002. Tug Support Vector Maches for Bomedcal Named Etty Recogto. I Proc. of the Workshop o Natural Laguage Processg the Bomedcal Doma (at ACL 2002, -8. Lee K.J. Hwag Y.S. ad Rm H.C. Two-phase bomedcal NE Recogto based o SVMs. I Proceedgs of the ACL 2003 Workshop o Natural Laguage Processg Bomedce. pp.33-40. Sapporo, Japa. MUC6. 995. Morga Kaufma Publshers, Ic. I Proceedgs of the Sxth Message Uderstadg Coferece (MUC-6. Columba, Marylad. MUC7. 998. Morga Kaufma Publshers, Ic. I Proceedgs of the Seveth Message Uderstadg Coferece (MUC-7. Farfax, Vrga. Ohta T., Tates Y., Km J., Mma H., ad Tsuj J. 2002. The GENIA corpus: A aotated research abstract corpus molecular bology doma. I Proc. of HLT 2002. Platt J. 999. Probablstc Outputs for Support Vector Maches ad comparsos to regularzed Lkelhood Methods. MIT Press. Proux D., Rechema F., Jullard L., Pllet V. ad Jacq B. 998. Detectg Gee Symbols ad Names Bologcal Texts: A Frst Step toward Pertet Iformato Extracto. I Proc. of Geome Iform Ser Workshop Geome Iform, 72-80. Schwartz A.S. ad Hearst M.A. 2003. A Smple Algorthm for Idetfyg Abbrevato Deftos Bomedcal Text. I Proc. of the Pacfc Symposum o Bocomputg (PSB 2003 Kaua. She Da, Zhag Je, Zhou GuoDog, Su Ja ad Ta Chew Lm, Effectve Adaptato of a Hdde Markov Model-based Named Etty Recogzer for Bomedcal Doma, Proceedgs of ACL 2003 Workshop o Natural Laguage Processg Bomedce, Sapporo, Japa, July 2003. pp49-56. Tsuruoka Y. ad Tsuj J. 2003. Boostg precso ad recall of dctoary-based prote ame recogto. I Proceedgs of the ACL 2003 Workshop o Natural Laguage Processg Bomedce. pp.4-48. Sapporo, Japa. Vapk V. 995. The Nature of Statstcal Learg Theory. NY, USA: Sprger-Verlag. Vterb A.J. 967. Error bouds for covolutoal codes ad a asymptotcally optmum decodg algorthm. IEEE Trasactos o Iformato Theory, 260-269. Zhou G.D. ad Su J. 2002. Named Etty Recogto usg a HMM-based Chuk Tagger. I Proc. of the 40th Aual Meetg of the Assocato for Computatoal Lgustcs (ACL, 473-480. 7