System of Automatic Chinese Webpage Summarization Based on The Random Walk Algorithm of Dynamic Programming

Similar documents
LOW-COMPLEXITY VIDEO ENCODER FOR SMART EYES BASED ON UNDERDETERMINED BLIND SIGNAL SEPARATION

Statistics AGAIN? Descriptives

Error Concealment Aware Rate Shaping for Wireless Video Transport 1

Hybrid Transcoding for QoS Adaptive Video-on-Demand Services

Product Information. Manual change system HWS

A Comparative Analysis of Disk Scheduling Policies

Following a musical performance from a partially specified score.

Product Information. Manual change system HWS

Optimized PMU placement by combining topological approach and system dynamics aspects

Integration of Internet of Thing Technology in Digital Energy Network with Dispersed Generation

Instructions for Contributors to the International Journal of Microwave and Wireless Technologies

tj tj D... '4,... ::=~--lj c;;j _ ASPA: Automatic speech-pause analyzer* t> ,. "",. : : :::: :1'NTmAC' I

Decision Support by Interval SMART/SWING Incorporating. Imprecision into SMART and SWING Methods

Study on the location of building evacuation indicators based on eye tracking

Technical Information

Accepted Manuscript. An improved artificial bee colony algorithm for flexible job-shop scheduling problem with fuzzy processing time

RIAM Local Centre Woodwind, Brass & Percussion Syllabus

Craig Webre, Sheriff Personnel Division/Law Enforcement Complex 1300 Lynn Street Thibodaux, Louisiana 70301

QUICK START GUIDE v0.98

Why Take Notes? Use the Whiteboard Capture System

Simon Sheu Computer Science National Tsing Hua Universtity Taiwan, ROC

AMP-LATCH* Ultra Novo mm [.025 in.] Ribbon Cable 02 MAR 12 Rev C

arxiv: v1 [cs.cl] 12 Sep 2018

Simple VBR Harmonic Broadcasting (SVHB)

3 Part differentiation, 20 parameters, 3 histograms Up to patient results (including histograms) can be stored

Analysis of Subscription Demand for Pay-TV

Novel Quantization Strategies for Linear Prediction with Guarantees

The UCD community has made this article openly available. Please share how this access benefits you. Your story matters!

current activity shows on the top right corner in green. The steps appear in yellow

Quantization of Three-Bit Logic for LDPC Decoding

THE IMPORTANCE OF ARM-SWING DURING FORWARD DIVE AND REVERSE DIVE ON SPRINGBOARD

The Traffic Image Is Dehazed Based on the Multi Scale Retinex Algorithm and Implementation in FPGA Cui Zhe1, a, Chao Li2, b *, Jiaqi Meng3, c

SONG STRUCTURE IDENTIFICATION OF JAVANESE GAMELAN MUSIC BASED ON ANALYSIS OF PERIODICITY DISTRIBUTION

Improving Reliability and Energy Efficiency of Disk Systems via Utilization Control

TRADE-OFF ANALYSIS TOOL FOR INTERACTIVE NONLINEAR MULTIOBJECTIVE OPTIMIZATION Petri Eskelinen 1, Kaisa Miettinen 2

Modeling Form for On-line Following of Musical Performances

Cost-Aware Fronthaul Rate Allocation to Maximize Benefit of Multi-User Reception in C-RAN

Correcting Image Placement Errors Using Registration Control (RegC ) Technology In The Photomask Periphery

Reduce Distillation Column Cost by Hybrid Particle Swarm and Ant

Failure Rate Analysis of Power Circuit Breaker in High Voltage Substation

Multi-Line Acquisition With Minimum Variance Beamforming in Medical Ultrasound Imaging

Detecting Errors in Blood-Gas Measurement by Analysiswith Two Instruments

Small Area Co-Modeling of Point Estimates and Their Variances for Domains in the Current Employment Statistics Survey

Color Monitor. L200p. English. User s Guide

User s manual. Digital control relay SVA

MODELING AND ANALYZING THE VOCAL TRACT UNDER NORMAL AND STRESSFUL TALKING CONDITIONS

Automated composer recognition for multi-voice piano compositions using rhythmic features, n-grams and modified cortical algorithms

Production of Natural Penicillins by Strains of Penicillium chrysogenutn

A Quantization-Friendly Separable Convolution for MobileNets

Simple Solution for Designing the Piecewise Linear Scalar Companding Quantizer for Gaussian Source

Anchor Box Optimization for Object Detection

Scalable QoS-Aware Disk-Scheduling

T541 Flat Panel Monitor User Guide ENGLISH

Conettix D6600/D6100IPv6 Communications Receiver/Gateway Quick Start

AN INTERACTIVE APPROACH FOR MULTI-CRITERIA SORTING PROBLEMS

A Scalable HDD Video Recording Solution Using A Real-time File System

Product Information. Miniature rotary unit ERD

AIAA Optimal Sampling Techniques for Zone- Based Probabilistic Fatigue Life Prediction

Critical Path Reduction of Distributed Arithmetic Based FIR Filter

A STUDY OF TRUMPET ENVELOPES

Product Information. Universal swivel units SRU-plus

INSTRUCTION MANUAL FOR THE INSTALLATION, USE AND MAINTENANCE OF THE REGULATOR GENIUS POWER COMBI

Environmental Reviews. Cause-effect analysis for sustainable development policy

SWS 160. Moment loading. Technical data. M x max Nm M y max Nm. M z max Nm

Turn it on. Your guide to getting the best out of BT Vision

JTAG / Boundary Scan. Multidimensional JTAG / Boundary Scan Instrumentation. Get the total Coverage!

Clock Synchronization in Satellite, Terrestrial and IP Set-top Box for Digital Television

Emotional Metaphors for Emotion Recognition in Chinese Text

FPGA Implementation of Cellular Automata Based Stream Cipher: YUGAM-128

Modular Plug Connectors (Standard and Small Conductor)

Research on Sentence Relevance Based on Semantic Computation

INTERCOM SMART VIDEO DOORBELL. Installation & Configuration Guide

Fast Intra-Prediction Mode Decision in H.264/AVC Based on Macroblock Properties

JTAG / Boundary Scan. Multidimensional JTAG / Boundary Scan Instrumentation

Product Bulletin 40C 40C-10R 40C-20R 40C-114R. Product Description For Solvent, Eco-Solvent, UV and Latex Inkjet and Screen Printing 3-mil vinyl films

Discussion Paper Series

Lost on the Web: Does Web Distribution Stimulate or Depress Television Viewing?

User guide. Receiver-In-Ear hearing aids. resound.com

User Manual. AV Router. High quality VGA RGBHV matrix that distributes signals directly. Controlled via computer.

User guide. Receiver-In-The-Ear hearing aids, rechargeable Hearing aid charger. resound.com

US Al (19) United States (12) Patent Application Publication (10) Pub. No.: US 2014/ A1 ABE (43) Pub. Date: Jun.

Sealed Circular LC Connector System Plug

CASH TRANSFER PROGRAMS WITH INCOME MULTIPLIERS: PROCAMPO IN MEXICO

CONNECTIONS GUIDE. To Find Your Hook.up Turn To Page 1

include a comment explaining the reason and the portions of the pending application that are being

Academic Standards and Calendar Committee Report # : Proposed Academic Calendars , and

Expressive Musical Timing

Product Information. Universal swivel units SRU-plus 25

User guide. Receiver-In-The-Ear hearing aids, rechargeable Hearing aid charger. resound.com

User Manual ANALOG/DIGITAL, POSTIONER RECEIVER WITH EMBEDDED VIACCESS AND COMMON INTERFACE

Bachelor s Degree Programme (BDP)

A question of character. Loewe Connect ID.

CONNECTIONS GUIDE. To Find Your Hook.up Turn To Page 1

S Micro--Strip Tool in. S Combination Strip Tool ( ) S Cable Holder Assembly (Used only

Loewe bild 7.65 OLED. Set-up options. Loewe bild 7 cover Incl. Back cover. Loewe bild 7 cover kit Incl. Back cover and Speaker cover

IN DESCRIBING the tape transport of

in Partial For the Degree of

SCTE Broadband Premises Technician (BPT)

Loewe bild 5.55 oled. Modular Design Flexible configuration with individual components. Set-up options. TV Monitor

Operating Instructions. TV. Television HomeMultiMedia DVD/Video Audio Telekommunikation. Calida 5784 ZP Planus 4663 Z Planus 4670 ZW Planus 4672 ZP

Transcription:

Send Orders for Reprnts to reprnts@benthamscence.ae The Open Cybernetcs & Systemcs Journal, 205, 9, 35-322 35 Open Access System of Automatc Chnese Webpage Summarzaton Based on The Random Walk Algorthm of Dynamc Programmng Feng Wang *,, XaoMng Qn 2, Yzhen Wang 3 and Xnjang We School of Mathermatcs and Statstcs Scence, Ludong Unversty. Yanta 264025, Chna. Key Laboratory of Language Resource Development and Applcaton of Shandong Provnce; Yanta 264025, Chna; 2 College of Computer and Informaton Engneerng, Jaozuo Teachers College, Jaozuo 454000, Chna; Insttute of Computer Applcaton Technology, Jaozuo Teachers College, Jaozuo 454000, Chna; 3 Tlburg Unversty, Tlburg, The Netherlands Abstract: As the Internet becomes more and more deeply connected wth our lfe, the Internet has brought together mass text materal, and t s stll n explosve growth. In order to quckly and accurately to help users fnd the requred content, the tradtonal soluton s to use a search engne. However, the results of exstng automatc webpage summarzaton systems for search engne are of low qualty. Because they just based on statstcal method, gather some sentences n the web document besde the search phrases. Nether symbolzes the subject of the document, nor take nto account the user search phrases. Accordng to the shortages, An automatc webpage summarzaton systems s realzed. On the bass of the work done, ths paper proposed an automatc text summarzaton method based on relaton graph and text structure analyss. Ths method frstly segment text nto semantc paragraphs. For each semantc paragraph, a subject term dscover method based on relaton graph analyss s proposed. At last, both search phrase and document subject are take nto account, t extracts summary accordng to the gudance of the subject terms. Keywords: Automatc summarzaton, random walk, semantc relatedness, seme based graph, webpage.. INTRODUCTION As a basc ssue n natural language process (NLP) [], calculaton of lexcal semantc relevancy [2] s frequently nvolved n researches n ths feld to acheve success n semantc comprehenson, classfcaton and dsambguaton. However, owng to the lmted researches on semantc relevance of Chnese language and the statstcal methods based on lngustc data whch nvolves a large quantty of corpus as bass as well as long-term tranng whch could be subject to the sparsty and mbalance of corpus, dssatsfacton may come as a result. As such, n ths paper some sememe fgures based on CNKI have been proposed and furthermore, based on sememe fgures, some evaluaton models of wordrelevance are presented. To be specfc to ths model, a random walk algorthm based on dynamc programmng s hereby proposed to calculate drect and ndrect relevance between sememes. Accordng to the experment, the proposed methodology n ths paper enjoys the followng advantages: ) The model proposed n ths paper s more consstent wth the cognton of human beng; 2It s hghly related to the results of ratng by human beng ; 3) It s and easy to acheve. The new algorthm for lexcal relevance has been appled n the automatc summarzaton system for web page n Chnese by the author. Compared to other algorthms based on methods by Q.Lu [3], the accuracy of summarzaton by ths algorthm s greatly mproved. 2. TECHNOLOGY FOUNDATION 2.. Sememe and CNKI Sememe whch s one of concepts n Chnese Lngustcs s the mnmum semantc unt of word sense. Usng sememe to descrbe word sense can turn complexty nto smplcty and be convenent for formalzaton. In lngustcs, each word may contan several sememe (transmsson contans move, change, etc.) whch s the mnmum unt of language for ndependent use. Whle, the sememe s the semantc unt of semene. Sememe analyss s the most mportant and basc method for research on semantc, whch descrpton of word sense formalzaton s convenent for computer processng. CNKI whch uses sememe analyss to descrbe word sense s the most detaled and complete Chnese knowledge system. It names sememe as yyuan whch s the basc unt for CNKI to descrbe word sense, each word sense s descrbed by several yyuan. CNKI also uses more than 600 yyuan whch are classfed nto 0 types. Each yyuan type forms tree structure by hyponymy among yyuan. Ths paper used CNKI as word sememe dctonary to form sememe probablty network and calculate word relevance. 2.2. Random Walk Model Based on Sememe Fgure Random walk s a classc applcaton of Markov Chan. All statues and transton relatonshps among them n random walk can be regarded as a drected graph that the statue s a pont, transmsson relatonshp s an edge wth weght, the weght stands for transton probablty. In each step of 874-0X/5 205 Bentham Open

36 The Open Cybernetcs & Systemcs Journal, 205, Volume 9 Wang et al. random walk, there s the same probablty when current statues transfer to next statues n no relaton to prevous walk path and transmsson relatonshp [4]. Ths paper presented an extended random walk model whch used partcle to smulate human mnd. When consdered the stuaton of human evaluatng correlaton among words, people commonly would observe two words at the same tme, and then consder the semantc dfferences between these two words. However, sememe analyss uses sememe to descrbe word semantc, the lexcal correlaton can be obtaned by analyzng semantc relevant tghtness among sememe. Furthermore, we beleve that the overlap rato of two words sense can be reflected by fndng quantty of drect or ndrect semantc relaton on average between two words wthn a certan tme, and then ths overlap rato can be used to evaluate semantc correlaton among words [5]. Based on ths consderaton, we presented an extended random walk model that two partcles randomly walked at the same tme along the edge on sememe fgure formed on CNKI from two specfc words. Because, the partcle s each encounter case wll form a path to connect two words, whch stands for a type of semantc relaton, thus semantc correlaton among words can be evaluated by encounter probablty among partcles. Frstly, the probablty defned n ths paper showd that a walkng partcle whch begns to walk from reaches by passng an edge from any node. n = n P( n n ) (2-) () t ( t) nj xnj j nxv Secondly, calculated encounter probablty of two partcles separately begnnng to walk from and ( and ncluded): t ( x) ( tx) ( m) ( tm) yna ynb yna ynb ny v m= Pn ( n ) n n = (2-2) Secondly, calculated encounter probablty of two partcles begnnng to walk separately from and ( and ncluded): In addton, for the smulaton of human correlaton evaluaton process, we nduced a parameter t as random walk step lmt for semantc contact path length between two words as a reason that no semantc correlaton exsted nodes whch were far from begnnng words and beyond lmt of human analyss ablty. After the consderaton above, we ntroduced whch stands for the walk encounter probablty of two partcles walkng separately from wth the step less than t. and By method of ntroducng calculaton of two partcles encounter probablty, ths paper not only avoded that there stll needed to use vector dfference measurement algorthm wth ndstnct mplcaton when we should calculate only a partcle probablty s statonary dstrbuton, but also formed a semantc relevance calculaton model based on human ntellgence smulaton. 2.3. Sememe Formaton Ths paper abstracted sense and sememe from CNKI as node, explanaton by sense for sememe as semantc contact among nodes formed sememe fgure wth weght. Sememe node: each sememe of CNKI corresponded to each node n sememe fgure. For example, a node n fgure corresponds to sememe (means crme) n CNKI. Sense node stands for node of lexcal sense. For example, (zufan)# stands for the frst sense of (zufan) (word crmnal) whch means people who commt a crme. Edge from sememe to sense stands for explanaton relatonshp for sense by sememe, for example, three sememe of (crme, human and bad people (or weeds)) are used to explan sense (zufan) #. /3 /3 /3 Fg. (). Sememe graph. /4 /4 /4 /4 Fg. () shows parts of nodes and edges hghly related to two words of crme and offense n the complete sememe fgure. Above all, the sense fgure formed n ths paper contans 68 sememe nodes, 668 sense nodes, 397086 edges. Aware of probably exstng edges between sememe and sense n sememe fgure, thus the fnally formed sememe fgure s very sparse. 3. IMPROVED RANDOM WALK ALGORITHM Edge n sememe fgure can be showed by transton matrx NN. Insde, N stands for total edge quantty n E sememe fgure. Element shows condton probablty of transton from node to. As a reason, there s no effect nformaton to dstngush mportance among dfferent sememe, therefore all edges are equally treated n ths paper. Wth more specfcs, probabltes of transton from a spe-

System of Automatc Chnese Webpage Summarzaton The Open Cybernetcs & Systemcs Journal, 205, Volume 9 37 Fg. (2). 3 encounter modes and 5 transton stuatons wth step= (left). cfc node to dfferent adjacent nodes are same. Calculaton method of element, as follows: E ) [, j] = outdeg( n (3-) outdeg(n ) means out-degree of. A basc calculaton can be obtaned from the begnnng () t of formula 2-: calculate n n for all needed n j n j t and then calculate encounter probablty usng formula 2-2, fnally obtan the result usng formula 3. However, ths calculaton s tme complexty s qute hgh to O(n * t ) nsde 3 2 n means total node quantty, t means tme parameter n formula 2-3. Be notced of sparse matrx E and any path connectng two sense node passes sememe node whch quantty s low, thus early dealng wth sememe node wll reduce the scale of ssues. Therefore, ths paper nduced sememe encounter probablty matrx S[ mode, n, n j, d]( n, n Vseme) whch shows the encounter probablty of two partcles separately walkng from the begnnng of node n and node n j after d steps. Mode means two partcles encounter modes whch have three types: ) encounter n a mddle sense node after begnnng to walk at the same tme; 2) encounter at the second partcle s begnnng poston wth the stuaton that the frst partcle moves and the second stay where t s; 3) nstead of 2)encounter at the second partcle s begnnng poston wth the stuaton that the second partcle moves and the frst stay where t s. Be aware that the shortest path between two sememe nodes are formed by two edges: one s from sememe node to sense node, the other s from sense node to sememe node. Ths stuaton s named as step=, as shown n Fg. (2) left. The encounter probablty wth step= can be calculated usng followng formula: S[, n, n,] outdeg( n ) outdeg( n ) j = j (3-2) S[2, n, n,] outdeg( n ) j = (3-3) S[3, n, n,] outdeg( n ) j = j (3-4) After analyss, matrx s calculaton s consstent wth overlappng sub-problems and optmal substructure, so the dynamc programmng can be used to optmze calculaton. For ths ssue, there are fve dfferent exstng statues transton stuaton, as shown n Fg. (2) (rght). These fve stuatons can be nduced nto three encounter modes ( n, n, n V ): j k seme S[, n, n, l] = max( j nk S[, n, nk, l]* S[2, nk, nj,], S[3, n, nk, l]* S[2, nk, nj,], S[3, n, n, l]* S[, n, n,]) k k j (3-5) S[2, n, nj, l] = S[2, n, nk, l]* S[2, nk, nj,] (3-6) nk S[3, n, nj, l] = S[3, n, nk, l]* S[3, nk, nj,] (3-7) nk The tme complexty of stated statue transton formula above s the square of sememe node quantty whch leads to an mproved effcency. After addng of walk step lmt t, we obtaned a formula wth only consderaton on sememe P ( n, n, t)( n, n V ): before a b a b seme P(, n, n, t) = S[, n, n, d] s a b a b d = t P(2, n, n, t) = S[2, n, n, d] s a b a b d = t P(3, n, n, t) = S[3, n, n, d] s a b a b d = t (3-8) When Pbefore( na, nb, t)( na, nbvseme) s obtaned, the encounter probablty P ( n, n, t)( n, n V ) before a b a b seme among senses can be calculated by followng formula:

38 The Open Cybernetcs & Systemcs Journal, 205, Volume 9 Wang et al. Fg. (3). System man functon modules. P ( n, n, t) max( = before a b ( na, n) E ( nb, nj) E b s j outdeg( n ) * P (2, n, n, t)* outdeg( n ), outdeg n P n n t outdeg n ( b) * s(2,, j, )* ( a), outdeg n P n n t outdeg n ( b) * s(,, j, )* ( a), outdeg( n ) * P (3, n, n, t)* outdeg( n ) b s j a a s j j outdeg( n ) * P (2, n, n, t)* outdeg( n ) ), (3-9) So far, the needed encounter probablty among senses are obtaned, the calculaton complexty of ths algorthm s O N 2 * t. ( ) We can evaluate semantc relevance between senses usng ths encounter probablty, as follows: rela( n, n ) = P ( n, n, t) (3-0) a b before a b 4. CHINESE PAGE AUTOMATIC SUMMARY SYS- TEM DESIGN 4.. System Man Functon Modules As shown n Fg. (3), ths system are made from sx modules, such as, text logc structure analyss module, WEB document summary module, text physcal structure analyss module, keyword extracton module based on sememe fgure, content relevance analyss module, automatc summares and post process module. () Text logc structure analyss module: Recognze subhead, and dvde document nto several sectons by subhead. (2) WEB document summary module: Read webpages by external nput URL address, render webpages usng browser core and summary page text usng text summary method based on vson analyss. (3) Text physcal structure analyss module: Analyss (words, sentences and paragraph are ncluded) physcal structure n webpage text. (4) Keyword extracton module based on sememe fgure: Extract keywords vectors from semantc paragraph. (5) Content relevance analyss module: Calculate relevance among words and sentences to provde content relevance support for other modules. (6) Automatc summary and post process module: Usng algorthm n ths paper, extract summares from document and output to external caller after smple processng. Processng fgure of each module n automatc summary system, as shown n followng Fg. (4). As shown n Fg. (4), ths system s formed by two parts of webpage extracton module and automatc summary module. By external nput URL address, webpage pre-processng module reads, parses and renders web pages, also output webpages topc related content and webpages text after Web vson and label analyss, Chnese text characterstcs analyss. Webpages topc related content flter webpages nose and s outputted to search engne module to ndex. Webpages text s outputted to automatc summary module for post automatc summary processng. Automatc summary module are formed by sx modules, such as, semantc knowledge base module, text preprocessng module, lexcal relevance calculaton module, semantc paragraph dvson module, automatc summary and post processng module, Hadoop storage nterface. Semantc knowledge base module stores general vocabulary,, feld vocabulary, CNKI knowledge base, webpages summary knowledge base to provde support for other modules. Hadoop s responsble for Cachng ntermedate results to mprove speed of automatc summary. 4.2. System Desgn For accomplshment of dynamcly automatc summary whch s dvded nto two sub-processes as automatc sum-

System of Automatc Chnese Webpage Summarzaton The Open Cybernetcs & Systemcs Journal, 205, Volume 9 39 Fg. (4). System man functon module. mary pre-processng and dynamcly automatc summary process. After the dvson of webpages text semantc paragraph, ntermedate results wll be cached nto Hadoop platform through Hadoop storage nterface. When user nputs search words, the search engne wll transmt two parameters of webpages ID and search Query to automatc summary module. Through Hadoop storage nterface, the automatc summary module takes out ntermedate results correspondng to ths webpages ID for drectly outputtng summary after sentences choce and post processng module. Fg. (5). Web pages text extracton module processes. As shown n Fg. (5), webpages context extracton process, as follows: ) Read nput URL correspondng to HTML fles; 2) Process HTML label and render ths webpage; 3) Analyze webpages by vson tree and each secton s poston and area; 4) Dvde words n each secton and analyze text characterstcs. 5) Totally consder the results of vson tree analyss and text characterstcs analyss. 6) Extract out webpages topc related context and webpages context. As shown n Fg. (6), automatc summary process, as follows: ) Obtan web pages context from web pages preprocessng module and calculate the unque ID correspondng to ths webpage; 2) Mark n words dvson and words speech; 3) Identfy ttle and dvde sentence, paragraph, subhead; 4) Dvde semantc paragraph usng text structure analyss algorthm. Lexcal relevance calculaton module provdes support of lexcal relevance calculaton. 5) Intermeddle results after semantc paragraph dvson wll be cached on Hadoop platform through Hadoop storage nterface wth takng ths webpage unque ID as recognton. Dynamcly automatc summary process, as follows: ) Read ntermeddle results through Hadoop storage nterface usng obtaned webpages unque ID; 2) Scores for sentences usng the method based on the combnng noumenon wth TF-IDF, summary sentences wth certan proporton from each semantc paragraph to form summares; 3) Post processng for summares and fltraton for sentences wth repeated semantc to mprove accuracy and enhance readablty, and so on

320 The Open Cybernetcs & Systemcs Journal, 205, Volume 9 Wang et al. Fg. (6). Automatc summary module process. 4.3. System Interface As shown n Table. () Get artcle text Functon: Get artcle text for sngle document webpages, for example, sngle document webpages, return empty strng. (2) Get webpages summary nterface Functon: Get webpages summary, and outputted summary length sn t over max length. (3) Automatc summary pre-processng nterface Functon: Pre-process automatc summary through ths nterface nformng automatc summary module when webpages extracton by search engne, pre-processed webpages can call nterface 5 for fast dynamc summary. (4) Get fast dynamc summary nterface Functon: Fast get dynamc webpages summary whch needs to nput pre-processed webpages ID, outputted summary length sn t over max length, and empty strng wll return f webpages correspondng to ths ID sn t preprocessed by nterface 4. Interface s callng method: () Web servce (JSON/XML) (2) C# DLL Table. Interface defnton. (-) Interface Name Interface nput parameter (type, varable name) Interface output parameter (type) (-2) Interface Name Interface nput parameter (type, varable name) Interface nput parameter 2 (type, varable name) Interface output parameter (type) (-3) Interface Name Interface nput parameter (type, varable name) Interface nput parameter 2 (type, varable name) Interface output parameter (type) Get Artcle Text Strng, URL Strng Get Summary Strng, URL Strng, MaxLength Strng Process Summary Strng, URL Strng, ID Int

System of Automatc Chnese Webpage Summarzaton The Open Cybernetcs & Systemcs Journal, 205, Volume 9 32 (-4) Interface Name Interface nput parameter (type, varable name) Interface nput parameter 2 (type, varable name) Interface output parameter (type) Table. Contd.. Get Dynamc Summary Strng, ID Strng, MaxLength Strng 5. EXPERIMENT FORMATION AND RESULT ANALYSIS Ths paper evaluates through calculaton of accuracy rate and recall rate by comparng wth automatc summary and human deal summary. For summary msson, only the rght meanng s everythng nstead of strctly comparng whether the summary generated by system s consstent wth experts summary, whch s too harsh. However, t s very hard for summary formed by human to reach unqueness. As descrpton of the same thng n dfferent ways, user also can form many dfferent common summary or the acceptable focusng-on user summary supposed by them. Actually, the experment shows that t s hard to be consstent wth the ssue whch sentences or paragraphs can be ncluded n a summary. Even a same summary expert, for hm, there are most dfferent among the summarys made by hm n a same artcle after an nterval tme. Thus, ths paper presented a new evaluaton strateges: the automatc summary s accurate f automatc summary s consstent wth sub-topc of an artcle covered by human summary. For example, human summary and automatc summary choose two dfferent sentences wth close meanng whch are treated accurate. On detaled calculaton method, totally evaluated summary accuracy usng accuracy rate, recall rate and F value. We used search engne crawlers technology to extract 2000 news pages wth dfferent style as test corpus. For avodng the artcle as an mportant factor wth too long or too short context to nfluent evaluaton results, ths paper only consder the general news wth mddle length that chosen web pages wth 30 sentences context for experment. To each artcle, the system generates automatc summary wth 0. Compresson rato frstly, and then extract 0% context as human deal summary. When human extracton, pror chosen sentences whch have been chosen by automatc summary system, and compared summary generated by system wth human summary. Ths paper evaluated summary n the vew of nformaton content, usng three mportant ndexes n the feld of nformaton retreval: accuracy rate, recall rate, totally accuracy rate and F value for recall rate. We wrote an automatc summary automated experment software to assst experment, and after experments among 2000 artcles, mproved algorthm s average accuracy rate to 0.502, average recall rate to 0.853, average F value to 0.727. part of experment results, as shown n Table 2: Through experment, we found that the accuracy and recall rate of mproved algorthm are better. Whle, we could fnd the hgh recall rate (85.3%) of mproved algorthm, whch are obvously hgher than accuracy rate correspondng to the desgn purpose of algorthm n ths paper. 6. SEMANTIC RELEVANCE EVALUATION There are two classc method for evaluaton of semantc relevance: one s the relevance sutable wth human evaluaton, the other s performance of evaluaton algorthm n specfc applcaton. Ths paper used two relevance evaluaton method above whch contans comparson wth a group of human evaluaton data, and appled ths algorthm nto an summary system, by whch observed the nfluences on summary. However, researchers ddn t agree on how to quantfy semantc relevance, ths paper evaluated usng spearman's p coeffcent. When comparson of these two relevance results, spearman's p coeffcent only consdered relatve rankng of relevance value. Presently, WordSmlarty-353(WS-353) s a common human evaluaton dataset of Englsh words relevance, ths research nvted many partcpants to score wth word par relevance. For the reason that no Chnese dataset wth hgh qualty, so we chose subset (00 words) of WS- 353, and translated word par nto Chnese whch was called CWord-00. The prncple of choosng word par s that the Englsh word ncluded n word par can be translated nto Chnese words wth the same meanng, and there are accurate and effcent sememe descrptons n CNKI. The wrter wrote a lexcal relevance calculaton software package that had mproved the algorthm. Takng several classc algorthm as comparson baselne, we calculated relevance value of word par n CWord-00, and compared wth human evaluaton to get the sortng relevance whch s sutable for comparson, the results are as shown n Table 3: Through experment, compared wth other algorthm, we found that mproved algorthm could provde relevance value hgher related wth human evaluaton. Some classc word par were chosen n Table 4 to compare wth popular method of Q. Lu [8]. Table 2. Improved algorthm results. Improved Algorthm 2 3 4 5 accuracy rate 0.4 3 0.65 5 0.5 2 0.7 0.6 5 recall rate 0.86 6 0.85 5 0.9 2 0.9 0 0.9 5

322 The Open Cybernetcs & Systemcs Journal, 205, Volume 9 Wang et al. Table 3. Comparson experment among several classc method (based on CWord-00 test set). Table 5. Influence on text automatc summary system accuracy usng mproved algorthm and Q. Lu method. Model CWord-00 Use Degree of Accuracy algorthm n ths paper 0.803 T. Hughes [7] 0.799 Q. Lu [3] 0.654 X. Yun [6] 0.677 Z. Shuqn [8] 0.730 Table 4. Use method n ths paper and method of Q. Lu [3] to calculate relevance of classc word par. Word Par Human Evaluaton Q. Lu [3] Improved Algorthm /.000.000.000 / 0.762 0.005 0.207 / 0.750 0.52 0.507 / 0.746 0.005 0.207 / 0.742 0.444 0.255 / 0.700 0.948 0.270 / 0.652 0.044 0.226 / 0.023 0.7 0.003 / 0.63.000 0.500 / 0.69 0.204 0.560 / 0.394 0.722 0.326 / 0.222 0.2 0.00 Ths text automatc summary system once used relevance evaluaton method of Q. Lu, however, whch calculated a wrong relevance value, and effected the accuracy of system summary. After changng Q. Lu method to method n ths paper, system summary accuracy was mproved vsbly. the results are as shown n Table 5: 7. SUMMARY Ths paper has researched totally on Chnese webpage automatc summary technology, and presented a lexcal semantc relevance algorthm based on CNKI knowledge and mproved algorthm 0.705 Q. Lu [3] 0.683 computatonal semantcs whch calculate drect and ndrect relevance between sememe usng mproved random walk algorthm. To be dfferent wth exsted relevance measure algorthm usng random walk model, ths paper presented to use average encounter probablty nstead of average arrvng probablty, whch would be n accordance wth relevance recognton for human, and also avoded to use vector dfference measure algorthm wth blurry mplcaton. However, there stll are some dsadvantages, such as, the generated summares are not smooth and fluent. Thus, more summary post processng technologes wll be added n to mprove summary readablty. CONFLICT OF INTEREST The authors confrm that ths artcle content has no conflct of nterest. ACKNOWLEDGEMENTS Ths research s supported n part by Natonal Scence Foundaton of Chna No. 637408. REFERENCES [] S. Banerjee, and T. Pedersen, Extended gloss overlaps as a measure of semantc relatedness, IJCAI, vol. 3, pp. 805-80, 2003. [2] E. Gabrlovch, S. Markovtch, Computng semantc relatedness usng wkpeda-based explct semantc analyss, IJCAI, vol. 7, pp.606-6, 2007. [3] Q. Lu, and S. L, Word smlarty computng based on how-net, Computatonal Lngustcs and Chnese Language Processng, vol. 7, no. 2, pp. 59-76, 2002. [4] P.Berkhn, A survey on pagerank computng, Internet Mathematcs, vol. 2, no., pp. 73-20, 2005. [5] W. Y and W. Xaoln, Algorthm for words semantc relevancy based on modfed algorthm for sememes relevancy, Jouranal of The Chna Socety For Scentfc And Techncal Informaton, vol. 3, no. 2, pp. 27-275, 202. [6] X. Yun Xu and F. F. Zhang. Semantc relevancy computng based on hownet, Transactons of Bejng Insttute of Technology, vol. 25, no. 5, pp. 4-44, 2005. [7] T. Hughes, D. Ramage, Lexcal Semantc Relatedness wth Random Graph Walks, EMNLP-CoNLL, 2007, pp. 58-589. [8] Z. Shuqn and W. Yangyang. The model of words relaton computng based on the HowNet, Mcrocomputer and ITS Applcatons, vol. 3, no. 8, pp. 77-80, 202. Receved: June 0, 205 Revsed: July 29, 205 Accepted: August 5, 205 Wang et al.; Lcensee Bentham Open. Ths s an open access artcle lcensed under the terms of the Creatve Commons Attrbuton Non-Commercal Lcense (http://creatvecommons.org/- lcenses/by-nc/3.0/) whch permts unrestrcted, non-commercal use, dstrbuton and reproducton n any medum, provded the work s properly cted.