Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Similar documents
Identifying Related Documents For Research Paper Recommender By CPA and COA

Identifying Related Work and Plagiarism by Citation Analysis

Evaluating the CC-IDF citation-weighting scheme: How effectively can Inverse Document Frequency (IDF) be applied to references?

CITATION INDEX AND ANALYSIS DATABASES

Research Paper Recommendation Using Citation Proximity Analysis in Bibliographic Coupling

A study of scientometrics analysis of research output performance of malaria

Bibliometric glossary

Lessons Learned: The Complexity of Accurate Identification of in-text Citations

Discussing some basic critique on Journal Impact Factors: revision of earlier comments

Web-based Demonstration of Semantic Similarity Detection Using Citation Pattern Visualization for a Cross Language Plagiarism Case

Cited Publications 1 (ISI Indexed) (6 Apr 2012)

A Visualization of Relationships Among Papers Using Citation and Co-citation Information

Predicting the Importance of Current Papers

In basic science the percentage of authoritative references decreases as bibliographies become shorter

Google Scholar and ISI WoS Author metrics within Earth Sciences subjects. Susanne Mikki Bergen University Library

Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method

Peter Ingwersen and Howard D. White win the 2005 Derek John de Solla Price Medal

Figures in Scientific Open Access Publications

Citation Analysis. Presented by: Rama R Ramakrishnan Librarian (Instructional Services) Engineering Librarian (Aerospace & Mechanical)

Bibliometric analysis of the field of folksonomy research

AUTHORSHIP PATTERN: SCIENTOMETRIC STUDY ON CITATION IN JOURNAL OF DOCUMENTATION

Scientometric Measures in Scientometric, Technometric, Bibliometrics, Informetric, Webometric Research Publications

A Scientometric Study of Digital Literacy in Online Library Information Science and Technology Abstracts (LISTA)

Scientometrics & Altmetrics

Ranking Similar Papers based upon Section Wise Co-citation Occurrences

A Taxonomy of Bibliometric Performance Indicators Based on the Property of Consistency

How to write a seminar paper An introductory guide to academic writing

Journal of American Computing Machinery: A Citation Study

CitNetExplorer: A new software tool for analyzing and visualizing citation networks

THE JOURNAL OF POULTRY SCIENCE: AN ANALYSIS OF CITATION PATTERN

National University of Singapore, Singapore,

2015: University of Copenhagen, Department of Science Education - Certificate in Higher Education Teaching; Certificate in University Pedagogy

Contribution of Chinese publications in computer science: A case study on LNCS

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central

MSc Projects Information Searching. MSc Projects Information Searching. Peter Hancox Computer Science

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes

A Discriminative Approach to Topic-based Citation Recommendation

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

CONTRIBUTION OF INDIAN AUTHORS IN WEB OF SCIENCE: BIBLIOMETRIC ANALYSIS OF ARTS & HUMANITIES CITATION INDEX (A&HCI)

2nd International Conference on Advances in Social Science, Humanities, and Management (ASSHM 2014)

Lokman I. Meho and Kiduk Yang School of Library and Information Science Indiana University Bloomington, Indiana, USA

A systematic empirical comparison of different approaches for normalizing citation impact indicators

Citation Impact on Authorship Pattern

Scientometric Analysis of Astrophysics Research Output in India 26 years

What is academic literature? Dr. B. Pochet Gembloux Agro-Bio Tech Liège university (Belgium)

Bibliometric Analysis of the Indian Journal of Chemistry

ENCYCLOPEDIA DATABASE

Delta Journal of Education 1 ISSN

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

SUBJECT INDEXING: A LITERATURE SURVEY AND TRENDS

Indian Journal of Science International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

AN OVERVIEW ON CITATION ANALYSIS TOOLS. Shivanand F. Mulimani Research Scholar, Visvesvaraya Technological University, Belagavi, Karnataka, India.

Citation analysis may severely underestimate the impact of clinical research as compared to basic research

EVALUATING THE IMPACT FACTOR: A CITATION STUDY FOR INFORMATION TECHNOLOGY JOURNALS

Review Process - How to review

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

Bibliometric measures for research evaluation

Electronic Research Archive of Blekinge Institute of Technology

Your research footprint:

Constructing bibliometric networks: A comparison between full and fractional counting

Analysing Musical Pieces Using harmony-analyser.org Tools

hprints , version 1-1 Oct 2008

System of Document Connections Based on References Nauchn-Techn.Inform. Ser.2, 1973 (6): 3-8

Digital Library Literature: A Scientometric Analysis

Publishing research. Antoni Martínez Ballesté PID_

International Journal of Library and Information Studies ISSN: Vol.3 (3) Jul-Sep, 2013

Keywords: Publications, Citation Impact, Scholarly Productivity, Scopus, Web of Science, Iran.

Automatic Music Clustering using Audio Attributes

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Article accepted in September 2016, to appear in Scientometrics. doi: /s x

Using Bibliometric Analyses for Evaluating Leading Journals and Top Researchers in SoTL

Bibliometric Analysis of Electronic Journal of Knowledge Management

Enhancing Music Maps

What is Web of Science Core Collection? Thomson Reuters Journal Selection Process for Web of Science

Can scientific impact be judged prospectively? A bibliometric test of Simonton s model of creative productivity

Growth of Literature and Collaboration of Authors in MEMS: A Bibliometric Study on BRIC and G8 countries

WHITEPAPER. Customer Insights: A European Pay-TV Operator s Transition to Test Automation

Wipe Scene Change Detection in Video Sequences

Cascading Citation Indexing in Action *

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Scientometric Profile of Presbyopia in Medline Database

WEB OF SCIENCE THE NEXT GENERATAION. Emma Dennis Account Manager Nordics

Usage versus citation indicators

Citation performance of Indonesian scholarly journals indexed in Scopus from Scopus and Google Scholar

INFORMATION USE PATTERN OF LIBRARY AND INFORMATION SCIENCE PROFESSIONALS: A BIBLIOMETRIC STUDY OF CONFERENCE PROCEEDINGS

Citation Analysis of International Journal of Library and Information Studies on the Impact Research of Google Scholar:

Open Research Online The Open University s repository of research publications and other research outputs

AUTHORS PRODUCTIVITY AND DEGREE OF COLLABORATION IN JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE (JOLIS)

Title characteristics and citations in economics

Measuring the Impact of Electronic Publishing on Citation Indicators of Education Journals

Working Paper Series of the German Data Forum (RatSWD)

USING TEXT FROM THE BOOK SPINE AS A MARKER FOR AUGMENTED REALITY IN LIBRARY BOOKSHELF MANAGEMENT SYSTEM

Embedding Librarians into the STEM Publication Process. Scientists and librarians both recognize the importance of peer-reviewed scholarly

Citation Analysis with Microsoft Academic

Russian Index of Science Citation: Overview and Review

Should author self- citations be excluded from citation- based research evaluation? Perspective from in- text citation functions

A TYPICAL CLASSIFICATION AND CATALOGUING PRACTICE FOR MANAGING CONFERENCE PROCEEDINGS IN A LIBRARY. Geeta W. Nabar, V.L. Kalyane and Vijai Kumar

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

Transcription:

Bela Gipp and Joeran Beel. Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI 09), volume 2, pages 571 575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics. ISSN 2175-1935. Downloaded from www.sciplore.org. Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis Bela Gipp 1 and Joeran Beel 2 1 Bela@Gipp.com, 2 J@Beel.org Otto-von-Guericke University, Dept. of Computer Science, Magdeburg, Germany Abstract This paper presents an approach for identifying similar documents that can be used to assist scientists in finding related work. The approach called Citation Proximity Analysis (CPA) is a further development of co-citation analysis, but in addition, considers the proximity of citations to each other within an article s full-text. The underlying idea is that the closer citations are to each other, the more likely it is that they are related. In comparison to existing approaches, such as bibliographic coupling, co-citation analysis or keyword based approaches the advantages of CPA are a higher precision and the possibility to identify related sections within documents. Moreover, CPA allows a more precise automatic document classification. CPA is used as the primary approach to analyse the similarity and to classify the 1.2 million publications contained in the research paper recommender system Scienstein.org. Introduction and Motivation The search for related scientific work can be tedious, and often important documents are missed out. Difficulties are caused by an increasing number of publications, growing exponentially at a yearly rate of 3.7 %, unclear nomenclature, synonyms and numerous other factors [1]. In practice, most searches for related work start with some initial papers and navigating the citation web nearest to those papers. However, even the more advanced approaches for identifying related work based on co-word analysis, collaborative filtering, Subject-Action-Object (SAO) structures or citation analysis do often not deliver satisfying results [2-8]. Therefore, we developed a new approach to determine the similarity of documents, which we name Citation Proximity Analysis (CPA). The approach is based on cocitation analysis and improves precision by considering the position of citations. The presented approach was developed for the research paper recommender Scienstein 1 to assist researchers in finding related work. The first part of this paper gives an overview about existing methods to identify similar documents, whereas the focus lies on the most popular citation analysis approaches and their strengths and weaknesses. The second part explains how the CPA can be used to measure similarity and the steps necessary to calculate a new metric that we call Citation Proximity Index (CPI). Afterwards, first results from an empirical study comparing the performance of co-citation analysis and CPA are presented. Finally, an outlook on further implications and how the CPA could be used in other fields is given. 1 www.scienstein.org is a research paper recommender focusing on identifying related work developed by the authors

Related Work Various approaches exist to determine the degree of similarity of documents in order to identify related work. Whereas text-mining approaches are used in cases in which references are not stated, citation analysis approaches usually deliver superior results as e.g. synonyms and unclear nomenclature do not lead to misleading results [3, 4, 5]. Many citation analysis approaches exist and they all have their own strengths and weaknesses for identifying similar documents. Among the most widely used are the easily applicable cited by approach, which considers papers as relevant that cite the same input document and the reference list approach, which considers papers as relevant that were referenced by the input document. The best results can usually be obtained by bibliographic coupling and co-citation analysis, which allow calculating the coupling strength [6]. These approaches, which were already invented in the 60s and 70s, are used by scientists and on academic search engine websites like CiteSeer 2 [9]. A citing C D B citing uments are bibliographically coupled if they cite one or more documents in common. Figure 1 illustrates this approach: Papers A and B are related because they both cite papers C, D and E. In contrast, two documents are cocited when at least one paper both. E This approach is illustrated in Figure 2: Papers A and B are related because they Figure 1: Bibliographic coupling are both cited by papers C, D and E. The more co-citations two papers receive, the more related they are [6]. A cited C D E B cited Figure 2: Co-citation analysis Although both approaches are suitable to identify similar papers, they serve different purposes. Whereas bibliographic coupling is retrospective, co-citation is essentially a forward-looking perspective [9]. However, both approaches often deliver unsatisfying results, since they only make use of the bibliography at the end of the document without analysing the constellation of citations. Since these approaches are system-inherent, it is also not possible to determine in which part of a related document the content of interest can be found. Citation Proximity Analysis (CPA) Instead of just using the bibliography, in CPA the information derived from the proximity of the citations to each other in the full-text is used to calculate the Citation Proximity Index (CPI) in three steps. 1. The document is parsed and a series of heuristics are used to process the citations including their position within the document 3. 2 http://eer.ist.psu.edu 3 The citations were parsed using a modified version of parscit (http://wing.comp.nus.edu.sg/parscit) in combination with exclusively developed software, which is available upon request from the authors.

2. The citations are assigned to their corresponding items in the bibliography. The overall margin of error with the system we have developed equals nearly three percent for the first and second step. 3. In the third step the proximity among each citation-pair is examined. The underlying assumption is that the closer the citations are to each other the more likely it is that they are related. Based on this proximity analysis, the CPI is calculated. If for example two citations are given in the same sentence the probability that they are very similar is higher (CPI = 1) as if they were only in the same paragraph (CPI = 1/2). See Figure 3. Citing ument This is one reference. This is an example text with references to different documents. Two very similar references [1],[2]. This is an This is an Another different documents. This is an example text with Another example. documents [3]. Another examplethis is an example text with. ument 1 ument 2 ument 3 [1] different documents. This is one reference [1], [2]. This is an Another This is an example text with documents.this is an example text with references to different Another example. documents.another [2]. This is one reference [1], [2]. This is an example text with Another example. This is an This is an [2]. Another This is an example text with [1] Another example. different documents. This is one reference [1], [2]. This is an example text with Another example. This is an [2]. This is an Another example. CPI = ¼ CPI = 1 However, further research needs to be performed to identify the appropriate Figure 3: Illustration of Citation Proximity Analysis weighting of the CPI values according to their occurrence, which also seems to depend on the publication s research field and publication s research type. For example, it seems that for analysing a technical report or patent specification, different weightings seem suitable. First empirical evaluations have lead to the values shown in Table 1 for calculating the CPI. Table 1: CPI Values Occurrence CPI value Sentence 1 Paragraph 1/2 Chapter 1/4 Same journal / 1/8 same book Same journal but 1/16 different edition The results delivered by CPA can be improved by evaluating as many sources as possible. This can be the case due to multiple occurrences of the same citation and due to multiple documents citing a certain document. In our series of tests we experienced the best results by calculating the weighted average of the CPIs. By automating the process described above, we have calculated the CPI for publications contained in the Scienstein database. The results show that in comparison to the results delivered by cocitation analysis, CPA delivers considerably better results in identifying similar documents. Empirical Comparison of Co-Citation Analysis and CPA In the following, first results of a study examining the suitability of CPA to identify related work are presented. The complete study will be published separately. As it would be unfeasible to compare the results with every known approach, the focus laid on a comparison with Co-citation analysis as this approach usually delivers the best results. The 21 study participants have been asked to select three similar documents from the Scienstein.org database and then six related work recommendations have been provided. Three of them were chosen based on co-citation strength and three based on CPA without indicating the used approach. The results show that the CPA performs significantly better in identifying related work than the commonly-used Co-citation analysis.

Figure 4: Comparison of CPA and Co-citation analysis As the pie chart indicates, nearly twice as many study participants obtained more suitable documents when the CPA was used in comparison to the documents obtained by co-citation analysis. Not surprisingly, the study also substantiated the assumption that especially for documents with extensive bibliography or documents that have not been referenced frequently, CPA delivers superior results. Taking into consideration that CPA essentially works like co-citation analysis with the distinctive difference that the proximity among citations is analysed and therefore additional information about relatedness is gathered, it is not surprising that CPA outperforms Cocitation analysis in every examined scenario 4. Outlook & Conclusion Besides identifying related work, the authors currently apply the idea behind CPA for automatic document classification for the research paper recommender Scienstein [11]. The aim is to automatically analyse the topics within documents by analysing the distribution of references within research papers. So instead of knowing, for instance, that a certain publication focuses on the relativity theory, the CPA makes it possible to identify the document sections focusing for example, on Time dilation, Length contraction or Massenergy equivalence and then to give specific recommendations within documents or books. Moreover, it is possible to combine the CPA with text mining algorithms in order to automatically detect e.g. contradicting studies. The author A has shown in his recent study [reference A] that in contrast to a previous study [reference B]... So by analysing the words between two references it is often possible to automatically analyse in which relationship these two references stand to each other. It is also often possible by knowing the position of each citation within a document to draw conclusions about the document type e.g. state-of-the art publications etc. The gained information can be used to classify further documents and to develop a more sophisticated Web of Science 5. We believe that these technologies in combination with collaborative filtering will be the future for identifying related work and will open the doors for powerful research paper recommender systems. As shown, the CPA offers substantial advantages in identifying related documents in comparison to existing approaches. However, it should also be taken into account that the effort to calculate the CPA is considerable. It is not sufficient to evaluate the bibliography of documents, but it is necessary to process the complete document, identify each reference and map it to the corresponding entry in the bibliography, which is in practice not always possible, and leads in ca. 3% of cases to mismatches. This is because sometimes only an 4 A detailed description of the study and its results will be published seperatly. 5 http://www.garfield.library.upenn.edu/papers/mapsciworld.html

abstract and the bibliography can be accessed, documents cannot be parsed as OCR 6 fails, or a reference style is used that makes it unfeasible to automatically link references to the corresponding items in the bibliography. This leads to the conclusion that although the CPA delivers superior results, it cannot completely replace co-citation analysis. References [1] May, R. M. 1997. The Scientific Wealth of Nations, Science, vol. 275, no. 5301, pp. 793-796. [2] Rip, A., & Courtial, J. (1984). Co-Word Maps of Biotechnology: An Example of Cognitive Scientometrics. Scientometrics, 6(6), 381-400. [3] Fano, R. M. 1956. Information theory and the retrieval of recorded information, in umentation in Action, Shera, J. H. Kent, A. Perry, J. W. (Edts), New York: Reinhold Publ. Co., pp. 238 244. [4] Marshakova, I. V. 1973. System of document connections based on references, Nauchno-Tekhnicheskaya Informatsiya, vol. 2, no. 6, pp. 3 8. [5] Beel, J. & Gipp, B. 2008, The Potential of Collaborative ument Evaluation for Science, the 11th International Conference on Digital Asian Libraries (ICADL 2008), December 2-5, Kuta, Indonesia, published in G. Buchanan, M. Masoodian & S. Cunningham (Eds.), Digital Libraries: Universal and Ubiquitous Access to Information of Lecture Notes in Computer Science, vol. 5362, DOI 10.1007/978-3- 540-89533-6, ISSN 0302-9743, pp. 375-378, Springer-Verlag Berlin Heidelberg. [6] Small, H. 1973. Co-citation in the scientific literature: a new measure of the relationship between two documents, Journal of the American Society for Information Science, vol. 24, pp. 265 269. [7] Klavans, R., & Boyack, K. (2006). Identifying a better measure of relatedness for mapping science, Journal of the American Society for Information Science and Technology, Vol. 57, No. 2, pp. 251-263. [8] Sternitzke, C. Bergmann, I. (2009), Similarity measures for document mapping: A comparative study on the level of an individual scientist, Scientometrics, Vol. 78, No. 1, pp. 113-130. [9] Garfield, E. (2001, November 27, 2001). From Bibliographic Coupling to Co-CitationAnalysis Via Algorithmic Historio-Bibliography: A Citationist s Tribute to BelverC. Griffith. Paper presented at the Drexel University, Philadelphia, PA. [10] Giles, C. L. Bollacker, K. D. And Lawrence, S. 1998. CiteSeer: an automatic citation indexing system, In Digital Libraries 98 - The Third ACM Conference on Digital Libraries, pp. 89-98. [11] Gipp, B. Beel, J. & Hentschel, C. (2009), Scienstein - A Research Paper Recommender System, in Proceedings of IEEE International Conference on Emerging Trends in Computing. Tamil Nadu, India. 6 Optical Character Recognition