Identifying functions of citations with CiTalO

Similar documents
Towards the automatic identification of the nature of citations

Characterising Citations in Scholarly Documents: The CiTalO Framework

A Multi-Layered Annotated Corpus of Scientific Papers

Enriching scientific citations to facilitate knowledge discovery

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

Determining sentiment in citation text and analyzing its impact on the proposed ranking index

Semantic annotation of publication entities using the SPAR (Semantic Publishing and Referencing) Ontologies

Lessons Learned: The Complexity of Accurate Identification of in-text Citations

An annotation scheme for citation function

National University of Singapore, Singapore,

Sentence and Expression Level Annotation of Opinions in User-Generated Discourse

A New Scheme for Citation Classification based on Convolutional Neural Networks

New analysis features of the CRExplorer for identifying influential publications

Scientific Authoring Support: A Tool to Navigate in Typed Citation Graphs

Automatic classification of citation function

LAMP-TR-157 August 2011 CS-TR-4988 UMIACS-TR CITATION HANDLING FOR IMPROVED SUMMMARIZATION OF SCIENTIFIC DOCUMENTS

High accuracy citation extraction and named entity recognition for a heterogeneous corpus of academic papers

The Open University s repository of research publications and other research outputs

Exploiting Cross-Document Relations for Multi-document Evolving Summarization

A Citation Centric Annotation Scheme for Scientific Articles

Using Citations to Generate Surveys of Scientific Paradigms

Citation Indexes for the Social Sciences and Humanities. Rūta Petrauskaitė Vytautas Magnus University Research Council of Lithuania

Enriching a Document Collection by Integrating Information Extraction and PDF Annotation

Citation Resolution: A method for evaluating context-based citation recommendation systems

Metonymy and Metaphor in Cross-media Semantic Interplay

Correlated to: Massachusetts English Language Arts Curriculum Framework with May 2004 Supplement (Grades 5-8)

A combination of opinion mining and social network techniques for discussion analysis

Enhancing Music Maps

The Biblissima Portal

Suggested Publication Categories for a Research Publications Database. Introduction

Introduction to WordNet, HowNet, FrameNet and ConceptNet

Report on the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)

Identifiers: bridging language barriers. Jan Pisanski Maja Žumer University of Ljubljana Ljubljana, Slovenia

A Visualization of Relationships Among Papers Using Citation and Co-citation Information

Scalable Semantic Parsing with Partial Ontologies ACL 2015

Paraphrasing Nega-on Structures for Sen-ment Analysis

Citations and Annotations in Classics:Old Problems and New Per

CITATION INDEX AND ANALYSIS DATABASES

Automated extraction of motivic patterns and application to the analysis of Debussy s Syrinx

K-means and Hierarchical Clustering Method to Improve our Understanding of Citation Contexts

UWA Publications Manual

Your Sentiment Precedes You: Using an author s historical tweets to predict sarcasm

Abstract. Justification. 6JSC/ALA/45 30 July 2015 page 1 of 26

Bibliometric analysis of the field of folksonomy research

Full-Text based Context-Rich Heterogeneous Network Mining Approach for Citation Recommendation

Automatic Detection of Sarcasm in BBS Posts Based on Sarcasm Classification

THE USE OF THOMSON REUTERS RESEARCH ANALYTIC RESOURCES IN ACADEMIC PERFORMANCE EVALUATION DR. EVANGELIA A.E.C. LIPITAKIS SEPTEMBER 2014

WORLD LIBRARY AND INFORMATION CONGRESS: 75TH IFLA GENERAL CONFERENCE AND COUNCIL

SCOPUS : BEST PRACTICES. Presented by Ozge Sertdemir

Publishing Your Article in a Journal

jsymbolic 2: New Developments and Research Opportunities

Dimensions of Argumentation in Social Media

Improving MeSH Classification of Biomedical Articles using Citation Contexts

Sentiment Analysis. Andrea Esuli

ResearchSpace: Querying a Semantic Network

Cascading Citation Indexing in Action *

Introduction to Sentiment Analysis. Text Analytics - Andrea Esuli

The ACL Anthology Network Corpus. University of Michigan

Deriving the Impact of Scientific Publications by Mining Citation Opinion Terms

Affect-based Features for Humour Recognition

ITU-T Y.4552/Y.2078 (02/2016) Application support models of the Internet of things

15th International Conference on New Interfaces for Musical Expression (NIME)

Sentiment Analysis of English Literature using Rasa-Oriented Semantic Ontology

Processing Skills Connections English Language Arts - Social Studies

Modelling Intellectual Processes: The FRBR - CRM Harmonization. Authors: Martin Doerr and Patrick LeBoeuf

Metonymy Research in Cognitive Linguistics. LUO Rui-feng

Types of Publications

Citation analysis: Web of science, scopus. Masoud Mohammadi Golestan University of Medical Sciences Information Management and Research Network

The linguistic patterns and rhetorical structure of citation context: an approach using n-grams

Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method

Bibliometric glossary

ABSTRACT. Keywords: idioms, types of idioms, meanings, song lyrics. iii

Using synchronic and diachronic relations for summarizing multiple documents describing evolving events

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Projektseminar: Sentimentanalyse Dozenten: Michael Wiegand und Marc Schulder

The Ontological Character of Classes in the Dewey Decimal Classification. Rebecca Green Michael Panzer OCLC Online Computer Library Center, Inc.

Sample assessment instrument and student responses. Extended response: Written persuasive text suitable for a public audience

Key-Words: - citation analysis, rhetorical metadata, visualization, electronic systems, source synthesis.

-SQA-SCOTTISH QUALIFICATIONS AUTHORITY. Hanover House 24 Douglas Street GLASGOW G2 7NQ NATIONAL CERTIFICATE MODULE DESCRIPTOR

Working BO1 BUSINESS ONTOLOGY: OVERVIEW BUSINESS ONTOLOGY - SOME CORE CONCEPTS. B usiness Object R eference Ontology. Program. s i m p l i f y i n g

NAMING AND REGISTRATION OF IOT DEVICES USING SEMANTIC WEB TECHNOLOGY

and Beyond How to become an expert at finding, evaluating, and organising essential readings for your course Tim Eggington and Lindsey Askin

On the Citation Advantage of linking to data

FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata

Word Sense Disambiguation in Queries. Shaung Liu, Clement Yu, Weiyi Meng

Policies and Procedures for Submitting Manuscripts to the Journal of Pesticide Safety Education (JPSE)

The Google Scholar Revolution: a big data bibliometric tool

Who Speaks for Whom? Towards Analyzing Opinions in News Editorials

Helping Metonymy Recognition and Treatment through Named Entity Recognition

Modules Multimedia Aligned with Research Assignment

CHAPTER 2 REVIEW OF RELATED LITERATURE. advantages the related studies is to provide insight into the statistical methods

The ACL Anthology Reference Corpus: a reference dataset for bibliographic research

Taxonomy Displays Bridging UX & Taxonomy Design. Content Strategy Seattle Meetup April 28, 2015 Heather Hedden

The Ontological Level: Revisiting 30 Years of Knowledge Representation

Edith Cowan University Government Specifications

CRIS with in-text citations as interactive entities. Sergey Parinov CEMI RAS and RANEPA

ITU-T Y Functional framework and capabilities of the Internet of things

Sarcasm Detection in Text: Design Document

CLARIN - NL. Language Resources and Technology Infrastructure for the Humanities in the Netherlands. Jan Odijk NO-CLARIN Meeting Oslo 18 June 2010

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Transcription:

Identifying functions of citations with CiTalO Angelo Di Iorio 1, Andrea Giovanni Nuzzolese 1,2, and Silvio Peroni 1,2 1 Department of Computer Science and Engineering, University of Bologna (Italy) 2 STLab-ISTC Consiglio Nazionale delle Ricerche (Italy) diiorio@cs.unibo.it, nuzzoles@cs.unibo.it, essepuntato@cs.unibo.it Abstract. Bibliographic citation is one of the most important activities of an author in the production of any scientific work. The reasons that an author cites other publications are varied: to gain assistance of some sort, to review, critique or refute previous works, etc. In this paper we propose a tool, called CiTalO, to infer automatically the nature of citations by means of Semantic Web technologies and NLP techniques. Such a characterisation makes citations more effective for linking, disseminating, exploring and evaluating research. 1 Introduction Bibliographic citations are the most used tools of academic communities for linking research, for instance by connecting scientific papers to related works or sources of experimental data. Citations are also tools for disseminating, as largely discussed in [9], and exploring research, for instance providing new interfaces for browsing data. Finally, citations are useful for evaluating research, e.g. through bibliometric measures such as h-index and impact factor. All these activities can be radically improved by exploiting the actual nature of citations, i.e. the author s reason for citing a given paper [11]. The mere existence of a citation, in fact, does not provide any information about the reasons the author had in mind when creating that citation to some particular document rather than to another. It is the characterization of a citation that really capture its meaning and effect. The goal of this paper is to present CiTalO, a tool that automatically annotates citations with properties defined in CiTO (Citation Typing Ontology) 3 [7]. These properties describe the nature of citations in scholarly works. CiTalO is implemented in Java and can be used as either stand-alone component or web service. A demo version is also available at http://wit.istc.cnr.it :8080/tools/citalo: users can use a simple HTML form to submit an English sentence containing a citation to CiTalO and to receive the list of CiTO properties that characterize the nature of that citation. Multiple configurations can also be tested by using the same prototype. CiTalO exploits Semantic Web technologies and NLP techniques to produce the output. The tool is designed as a chain of analysers that (i) produce ontological statements from texts, (ii) search 3 CiTO: http://purl.org/spar/cito.

2 Angelo Di Iorio, Andrea Giovanni Nuzzolese, and Silvio Peroni patterns in those statements, (iii) maps those patterns into linguistic resources and (iv) use these resources to produce the final characterization conform to CiTO. The chain also includes a sentiment-analysis module to refine results. The paper is structured as follows. In Section 2 we introduce previous works on classification of citations. In Section 3 we describe CiTalO introducing its structure. In Section 4, we conclude the paper sketching out some future works. 2 Related works In [3] Copestake et al. introduce the SciBorg framework, which includes a module for discourse and citation analysis that follows the Argumentative Zoning scheme proposed by Teufel et al. [10] and produces quite good results. Teufel et al. present a study about function of citations [11]. They provide a categorisation of possible citation functions organised in twelve classes, in turn clustered in Negative, Neutral and Positive rhetorical functions. They also performed some tests on hundreds of articles in computational linguistics, evaluating the output of several human annotators and a novel machine learning approach, and showed that the agreement between humans is actually higher than the agreement between humans and automatic analysis. Along the lines of the latter work, also Jorg analysed several documents within the ACL Anthology Networks 4 with the intent of identifying verbs usually used to carry important information about the nature of citations [6]. Closely related to the annotation of citation functions, in [2] Athar et al. propose and evaluate (with good result) a sentiment-analysis approach to citations, so as to identify whether a particular act of citing was done with positive (e.g. praising a previous work on a certain topic) or negative intentions (e.g. criticising the results obtained through a particular method). 3 CiTalO CiTalO tries to guess the function of citations by combining techniques of ontology learning from natural language, sentiment-analysis, word-sense disambiguation, and ontology mapping. These techniques are thought to be applied in a pipeline whose input is the sentence of an article containing the citation e.g. It extends the research outlined in earlier work X, where X is a reference to a particular bibliographic entity and the output is one or more properties of the CiTO ontology [7] cito:extends for the previous example. The overall architecture is shown in Fig. 1, while an extensive explanation of features and drawbacks of CiTalO can be found in [4]. Sentiment-analysis for gathering the polarity of citation functions. The aim of this step is to capture the sentiment polarity emerging from the text in which the citation is included. This is connected to the classification of CiTO properties provided in [7], where the semantics of rhetorical citations is expressed 4 ACL Anthology Network: http://clair.eecs.umich.edu/aan/index.php.

Identifying functions of citations with CiTalO 3 Fig. 1. The pipeline used by CiTalO. The input is the textual context in which the citation appears and the output is a set of properties of CiTO. according to three different polarities, i.e. positive, neuter and negative. Being able to recognize the polarity behind the citation, in fact, would restrict the set of possible target properties from the CiTO ontology to match. Notice also that such an analysis goes in parallel with the others in CiTalO, being it a refinement filter of the results. The current sentiment-analysis component is based on AlchemyAPI 5 but it can be easily replaced with other similar tools. Ontology extraction from the textual context of the citation. The first mandatory step of CiTalO consists of deriving a logical representation of the sentence containing the citation. This ontology extraction is performed by using FRED [8], a tool for ontology learning based on discourse representation theory, frames and ontology design patterns. The transformation of the sentence into a logical form allows us to recognize graph-patterns in order to detect possible types of rhetorical denotation of the citation. Consider, for instance, the sentence it extends the research outlined in earlier work X, where X is the cited work. The graphical representation of the output in FRED, that is also available as RDF statements, is presented in Fig. 2. Citation type extraction through pattern matching. The second step consists of extracting candidate types for the citation, by looking for patterns in the FRED result. We designed several graph-pattern-based heuristics by following similar criteria as lexico-syntactic patterns [1], extended with the exploitation of RDF graph topology and OWL semantics. These heuristics are implemented as SPARQL queries and some example are shown below: SELECT? type WHERE {? subj? prop fred : X.? subj a? type } SELECT? type WHERE {? subj? prop fred : X.? subj a? typetmp.? typetmp rdfs : subclassof +? type } SELECT? type WHERE {? subj a dul : Event.? subj a? type. FILTER (? type!= dul : Event )} SELECT? type WHERE {? subj a dul : Event.? subj a? typetmp.? typetmp rdfs : subclassof +? type. FILTER (? type!= dul : Event )} SELECT? type WHERE {? subj a dul : Event.? subj boxer : patient? patient.? patient a? type } 5 AlchemyAPI: http://www.alchemyapi.com.

4 Angelo Di Iorio, Andrea Giovanni Nuzzolese, and Silvio Peroni Fig. 2. FRED result for It extends the research outlined in earlier work X. The intended semantics of the above patterns is to select from the RDF graph all the types and their eventual taxonomies related to (i) the cited document, (ii) the events recognized into the citation, and the entities affected by those events (i.e. the entities playing the VerbNet role of being patient). Applying these patterns to graph shown in Fig. 2, the following candidate types are found: Outline, Extend, EarlierWork, Work, and Research. The current set of heuristics is quite simple and incomplete, but we are continuously updating the catalogue by both investigating new heuristics. Word-sense disambiguation. The next step consists of disambiguating the sense of each candidate type. This can be done through word-sense disambiguation services and APIs in CiTalO we use IMS [12]. The disambiguation is performed with respect to OntoWordNet [5] and produces a list of synsets for the candidate types. Going back to the example, this phase would produce the following list 6 : (i) Extend is disambiguated as own:synset-prolong-verb-1, (ii) Outline as own:synset-delineate-verb-3, (iii) Research as own:synset-research -noun-1, (iv) EarlierWork and Work as own:synset-work-noun-1. Alignment to CiTO. The last step consists of associating each synset to a CiTO property and refining results by using citation polarities and factual characterisation. We use two ontologies for this purpose: CiTO2Wordnet and CiTOFunctions. CiTO2Wordnet 7 maps all the CiTO properties defining citations with the appropriate Wordnet synsets [5]. CiTOFunctions 8 classifies each CiTO properties according to their factual and rhetorical functions [7]. The final alignment to CiTO is performed by means of a SPARQL CONSTRUCT query that uses the enhanced RDF graph obtained during the pipeline, the RDF graph of the polarity, OntoWordNet and the two ontologies just described. 6 The prefix own stands for http://www.w3.org/2006/03/wn/wn30/instances/. 7 CiTO2Wordnet ontology: http://www.essepuntato.it/2013/03/cito2wordnet. 8 CiTOFunctions: http://www.essepuntato.it/2013/03/cito-functions.

Identifying functions of citations with CiTalO 5 4 Conclusions CiTalO integrates Semantic Web technologies and NLP techniques to extract information about the nature, the motivations and the goals of each citation. The CiTalO architecture is composed of a pipeline of modules that map documents into ontological data, ontological data into linguistic resources and, finally, linguistic resources into CiTO properties. The implementation is still at an early stage. On the other hand, the overall approach is very open to incremental refinements. We are currently working to improve patterns matching phases in CiTalO and to include a mechanism for the automatic identification of textual context of citations given an input article. We also plan to perform exhaustive tests with a large set of documents and users. References 1. Aguado de Cea, G., Gómez-Pérez, A., Montiel-Ponsoda, E., Suárez-Figueroa, M. C. (2008). Natural Language-Based Approach for Helping in the Reuse of Ontology Design Patterns. In Proceedings of EKAW 2008: 32-47. DOI: 10.1007/978-3-540-87696-0 6 2. Athar, A., Teufel, S. (2012). Context-Enhanced Citation Sentiment Detection. In Proceedings of HLT-NAACL 2012: 597-601. 3. Copestake, A., Corbett, P., Murray-Rust, P., Rupp, C. J., Siddharthan, A., Teufel, S., Waldron, B. (2006). An architecture for language processing for scientific text. In Proceedings of the UK e-science All Hands Meeting 2006. 4. Di Iorio, A., Nuzzolese, A. G., Peroni, S. (2013). Towards the automatic identification of the nature of citations. To appear in Proceedings of SePublica 2013. 5. Gangemi, A., Navigli, R., Velardi, P. (2003). The OntoWordNet Project: Extension and Axiomatization of Conceptual Relations in WordNet. In Proceedings of CoopIS/DOA/ODBASE 2003: 820-838. DOI: 10.1007/978-3-540-39964-3 52 6. Jorg, B. (2008). Towards the Nature of Citations. In Poster Proceedings of FOIS 2008. 7. Peroni, S., Shotton, D. (2012). FaBiO and CiTO: ontologies for describing bibliographic resources and citations. In Journal of Web Semantics, 17 (December 2012): 33-43. DOI: 10.1016/j.websem.2012.08.001 8. Presutti, V., Draicchio, F., Gangemi, A. (2012). Knowledge extraction based on discourse representation theory and linguistic frames. In Proceedings of EKAW 2012: 114-129. DOI: 10.1007/978-3-642-33876-2 12 9. Shotton, D. (2009). Semantic publishing: the coming revolution in scientific journal publishing. In Learned Publishing, 22 (2): 85-94. DOI: 10. 1087/2009202 10. Teufel, S., Carletta, J., Moens, M. (1999). An annotation scheme for discourselevel argumentation in research articles. In Proceedings of the 9th Conference of the EACL 1999: 110-117. 11. Teufel, S., Siddharthan, A., Tidhar, D. (2006). Automatic classification of citation function. In Proceedings of EMNLP 2006: 103-110. 12. Zhong, Z., Ng, H. T. (2010). It Makes Sense: A wide-coverage word sense disambiguation system for free text. In Proceedings of ACL 2010, System Demonstrations: 78-83.