Identifying Related Work and Plagiarism by Citation Analysis

Similar documents
Identifying Related Documents For Research Paper Recommender By CPA and COA

Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Evaluating the CC-IDF citation-weighting scheme: How effectively can Inverse Document Frequency (IDF) be applied to references?

Web-based Demonstration of Semantic Similarity Detection Using Citation Pattern Visualization for a Cross Language Plagiarism Case

Research Paper Recommendation Using Citation Proximity Analysis in Bibliographic Coupling

Cited Publications 1 (ISI Indexed) (6 Apr 2012)

Ranking Similar Papers based upon Section Wise Co-citation Occurrences

National University of Singapore, Singapore,

Scientometrics & Altmetrics

Citation Analysis. Presented by: Rama R Ramakrishnan Librarian (Instructional Services) Engineering Librarian (Aerospace & Mechanical)

Enhancing Music Maps

Bibliometric glossary

Bibliometric analysis of the field of folksonomy research

Lessons Learned: The Complexity of Accurate Identification of in-text Citations

Your research footprint:

INTRODUCTION TO SCIENTOMETRICS. Farzaneh Aminpour, PhD. Ministry of Health and Medical Education

Bibliometric measures for research evaluation

Authorship Verification with the Minmax Metric

Scientometric Profile of Presbyopia in Medline Database

Authoring a Scientific Paper in Computer Graphics

A Visualization of Relationships Among Papers Using Citation and Co-citation Information

Predicting the Importance of Current Papers

CITATION INDEX AND ANALYSIS DATABASES

Google Scholar and ISI WoS Author metrics within Earth Sciences subjects. Susanne Mikki Bergen University Library

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Write to be read. Dr B. Pochet. BSA Gembloux Agro-Bio Tech - ULiège. Write to be read B. Pochet

Discussing some basic critique on Journal Impact Factors: revision of earlier comments

Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes

CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central

A Taxonomy of Bibliometric Performance Indicators Based on the Property of Consistency

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Edited Volumes, Monographs, and Book Chapters in the Book Citation Index. (BCI) and Science Citation Index (SCI, SoSCI, A&HCI)

Figures in Scientific Open Access Publications

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 2, March 2014

Lokman I. Meho and Kiduk Yang School of Library and Information Science Indiana University Bloomington, Indiana, USA

Review Process - How to review

International Journal of Library and Information Studies ISSN: Vol.3 (3) Jul-Sep, 2013

Research Ideas for the Journal of Informatics and Data Mining: Opinion*

What is Endnote? A bibliographical management software package designed to : Organize bibliographic references Create a bibliography

Do we use standards? The presence of ISO/TC-46 standards in the scientific literature ( )

1. Structure of the paper: 2. Title

Sarcasm Detection in Text: Design Document

Indian Journal of Science International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Journal of American Computing Machinery: A Citation Study

Scientometric and Webometric Methods

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

Key-Words: - citation analysis, rhetorical metadata, visualization, electronic systems, source synthesis.

INTRODUCTION TO SCIENTOMETRICS. Farzaneh Aminpour, PhD. Ministry of Health and Medical Education

How to write a seminar paper An introductory guide to academic writing

Should author self- citations be excluded from citation- based research evaluation? Perspective from in- text citation functions

Enabling editors through machine learning

SCOPUS : BEST PRACTICES. Presented by Ozge Sertdemir

Some Experiments in Humour Recognition Using the Italian Wikiquote Collection

THE JOURNAL OF POULTRY SCIENCE: AN ANALYSIS OF CITATION PATTERN

FLUX-CiM: Flexible Unsupervised Extraction of Citation Metadata

2nd International Conference on Advances in Social Science, Humanities, and Management (ASSHM 2014)

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

Article accepted in September 2016, to appear in Scientometrics. doi: /s x

INFORMATION USE PATTERN OF LIBRARY AND INFORMATION SCIENCE PROFESSIONALS: A BIBLIOMETRIC STUDY OF CONFERENCE PROCEEDINGS

DEPARTMENT OF ANTHROPOLOGY PORTLAND STATE UNIVERSITY

Report on the 2nd Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)

Writing Styles Simplified Version MLA STYLE

A Discriminative Approach to Topic-based Citation Recommendation

Preparing a Paper for Publication. Julie A. Longo, Technical Writer Sue Wainscott, STEM Librarian

The mf-index: A Citation-Based Multiple Factor Index to Evaluate and Compare the Output of Scientists

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Usage versus citation indicators

Improving MeSH Classification of Biomedical Articles using Citation Contexts

SEARCH about SCIENCE: databases, personal ID and evaluation

PUBLICATION RESEARCH TRENDS ON TECHNICAL REVIEW JOURNAL: A SCIENTOMETRIC STUDY

Scientometric Analysis of Astrophysics Research Output in India 26 years

WEB OF SCIENCE THE NEXT GENERATAION. Emma Dennis Account Manager Nordics

Publishing research. Antoni Martínez Ballesté PID_

Tag-Resource-User: A Review of Approaches in Studying Folksonomies

Bibliometric Analysis of Literature Published in Emerald Journals on Cloud Computing

CONTRIBUTION OF INDIAN AUTHORS IN WEB OF SCIENCE: BIBLIOMETRIC ANALYSIS OF ARTS & HUMANITIES CITATION INDEX (A&HCI)

Library resources & guides APA style Your research questions Primary & secondary sources Searching library e-resources for articles

Adaptive Key Frame Selection for Efficient Video Coding

Growth of Literature and Collaboration of Authors in MEMS: A Bibliometric Study on BRIC and G8 countries

Open Source Software for Arabic Citation Engine: Issues and Challenges

Tool-based Identification of Melodic Patterns in MusicXML Documents

Today s WorldCat: New Uses, New Data

Bibliography management and scientific communication with Mendeley

Writing Research Essays:

Embedding Librarians into the STEM Publication Process. Scientists and librarians both recognize the importance of peer-reviewed scholarly

TITLE OF CHAPTER FOR PD FCCS MONOGRAPHY: EXAMPLE WITH INSTRUCTIONS

Why Publish in Journals? How to write a technical paper. How about Theses and Reports? Where Should I Publish? General Considerations: Tone and Style

Profile of requirements for Master Theses

Russian Index of Science Citation: Overview and Review

Detecting Hoaxes, Frauds and Deception in Writing Style Online

F. W. Lancaster: A Bibliometric Analysis

Department of American Studies M.A. thesis requirements

Professor Birger Hjørland and associate professor Jeppe Nicolaisen hereby endorse the proposal by

VOLUME-I, ISSUE-V ISSN (Online): INTERNATIONAL RESEARCH JOURNAL OF MULTIDISCIPLINARY STUDIES

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

European Commission 7th Framework Programme SP4 - Capacities Science in Society 2010 Grant Agreement:

In basic science the percentage of authoritative references decreases as bibliographies become shorter

Assignment 6: Essay Sample

Battle of the giants: a comparison of Web of Science, Scopus & Google Scholar

Electronic Research Archive of Blekinge Institute of Technology

Transcription:

Erschienen in: Bulletin of IEEE Technical Committee on Digital Libraries ; 7 (2011), 1 Identifying Related Work and Plagiarism by Citation Analysis Bela Gipp OvGU, Germany / UC Berkeley, California, USA gipp@berkeley.edu Abstract This updated and revised paper gives an overview of my PhD research. It focuses on two newly developed approaches. Citation Proximity Analysis (CPA) allows the identification of related work by analyzing the co-occurrence of citations within documents. In contrast to co-citation analysis various factors, such as the proximity of citations to each other, are taken into account. The second approach is called Citation based Plagiarism Detection (CbPD). In comparison to the currently used text-based plagiarism detection approaches this citation- analyzing approach enables a better detection rate in identifying plagiarism forms such as paraphrasing, translations and idea plagiarism. Keywords: Document Similarity, Relatedness, Clustering, Plagiarism Detection, Duplicate Detection, Citation Analysis, Citation Proximity Analysis, Citation Order Analysis, Language Independent 1 Introduction & Motivation The search for related work is such a time-consuming procedure, that even when performed by experienced scientists, it often leads to unsatisfying results. To alleviate the problem, search engines such as Google Scholar and Citeseer offer to display related documents. The best results are usually achieved by hybrid research paper recommender systems. By combining techniques such as citation analysis, co-word analysis, collaborative filtering, and Subject-Action-Object (SAO) structures, recommendations can be given. However, these approaches are only suitable to a limited extent for identifying related work [1, 10, 2, 8, 11, 6, 13]. According to our examinations, for scientific documents, the best results can usually be achieved by applying the citation-based bibliographic coupling and co-citation analysis. The aim is to develop new citation-based approaches in order to identify related documents and plagiarism. So far, two new approaches have been developed, called Citation Proximity Analysis (CPA) and Citation based Plagiarism Detection (CbPD). CPA is a further development of co-citation analysis, whereas CbPD is based on bibliographic coupling, but in addition, analyzes the order of citations. Konstanzer Online-Publikations-System (KOPS) URL: http://nbn-resolving.de/urn:nbn:de:bsz:352-0-285689

Figure 1: GUI SciPlore clustering related documents In the research paper recommender system SciPlore.org, these approaches are mainly used for two purposes. First, to identify related documents as shown in Figure 1; and secondly, to give recommendations for related documents based on one or Based on document usage mining, Scienstein recommends more documents the user has been interested in, as shown in Figure 2. you the following papers: Papers similar to the last papers you have read The delicate topic of the impact factor Why the impact factor of journals should not be used for evaluating research Impact Factor: Good Reasons for Concern more... M. Szklo (2008), Epidemiology, vol. 19, no. 3 Papers recently published by authors you have read Figure 2: Recommendation of related papers Self-citations, co-authorships and keywords - A new approach to scientists field mobility Throughout this document two types of semantic relatedness are distinguished. I adopt the perspective Profiling of Resnik citation and impact consider - A new similarity methodology as a special case of semantic relatedness [9]. Two documents, for instance, are related if they address the same more... research question. Two documents are related and similar if they are, for instance, duplicates, plagiarized Title or translated. Author Year Update In the first part of Source this paper, Ratings related work Abstract is presented and currently applied citation analysis approaches discussed. In the next section the research design is presented. Afterwards, the CPA and CbPD are introduced and compared in regard to their suitability for Academic Recommender Systems. The paper concludes with a summary and an outlook. 2 Proposed Research & Related Work The usefulness of a research paper recommender system depends to a large extent on its ability to automatically determine related documents to one or more documents. Various approaches exist to measure the degree of relatedness in order to identify related work. Whereas text-mining approaches are used in cases in which references are not stated, citation analysis approaches usually deliver superior results, as e.g. synonyms

and unclear nomenclature do not lead to misleading results [1, 2, 8]. Many citation analysis approaches exist and they all have their own strengths and weaknesses for identifying related documents. Among the most widely used are the easily applicable cited by approach, which considers papers as relevant that cite the same input document and the reference list approach, which considers papers relevant that were referenced by the input document. Better results can usually be obtained by bibliographic coupling and co-citation analysis, which allow calculating the coupling strength [11]. These approaches, which were already invented in the 60s and 70s, are used by scientists and by academic search engines like CiteSeer 1 [3]. Doc A citing [1] Doc B citing cites [1] [2] cites [2] [3] [3] Doc A cited Doc B cited Figure 3: Bibliographic coupling (left) and Co-citation (right) Documents are bibliographically coupled if they cite one or more documents in common. Figure 3 (left) illustrates this approach: Papers A and B are related because they both cite papers 1, 2 and 3. In contrast, two documents are co-cited when at least one paper cites both. This approach is illustrated in Figure 3 on the right: Papers A and B are related because they are both cited by papers 1, 2 and 3. The more co-citations two papers receive, the more related they are [11]. Although both approaches are suitable to identify related papers, they serve different purposes. Whereas bibliographic coupling is retrospective, co-citation is essentially a forward-looking perspective [3]. However, both approaches often deliver unsatisfying results, since they only make use of the bibliography at the end of the document without analyzing the constellation of citations. Therefore, it is not possible to determine in which part of a related document the content of interest can be found. 3 Research Questions I want to answer the following three research questions in order to improve Academic Paper Recommender Systems. 1 http://citeseer.ist.psu.edu

What are the strengths and weaknesses of the currently used approaches in order to measure semantic relatedness? (whether it be citation-, text- or user behaviorbased) Is there a better way to automatically measure semantic relatedness? How do these new approaches perform in comparison to the currently used approaches? 4 Methodology The methodology follows six steps. Currently, the empirical study is in progress.. 1. Literature review and evaluation of existing approaches Text mining (bag of words, etc.) Citation analysis (bibliographic coupling, co-citation analysis) Community based approaches (tagging, annotating etc.) Further aspects like ranking algorithms, collaborative document evaluation, mind maps, etc. 2. Development of two new approaches to alleviate the shortcomings of existing approaches Citation / Quotation Proximity Analysis (CPA) Citation based Plagiarism Detection (CbPD) 3. Implementation of existing and new approaches in prototype see www.sciplore.org 4. Empirical comparison and analysis of suitability (qualitative and quantitative) Quality of results Performance 5. Extension and optimization of new approaches Combination with existing approaches Adjustment of parameters 6. Development of a procedure model that considers the document type Scientific publications containing citations and a clear structure such as abstract, related work, findings etc. Websites, patent applications, technical documents, etc. 5 First Results Two new approaches called Citation Proximity Analysis (CPA) and Citation based Plagiarism Detection (CbPD) have been developed. CPA is a variant of co-citation analysis that additionally considers the proximity of citations to each other within an article s full-text. The underlying idea is that the closer citations are to each other in a document, the more likely it is that the cited documents are related. For example, citations listed in the same sentence are more likely to express related thoughts than

This is an example text with [1] different documents. This is one reference [1], [2]. This is an example text with Another documents.this is an example text with references to different documents.another example. Another example. This is an example text with Another example. different documents.another example. This is another reference [2]. with Example. This is an example text with This is an example text with This is one reference. This is an example text with references to different documents. Two very similar references [1],[2]. This is an example text with This is an example text with Another example. Another example. This is an example text with different documents. This is an example text with Another example. documents [3]. Another examplethis is an example text with different documents.another example. This is another reference. with Example. This is an example text with This is an example text with Another example. different documents.another example. This is another reference [2]. with Example. This is an example text with This is an example text with This is one reference [1], [2]. This is an example text with Another example. This is an example text with This is an example text with Another example. Another example. different documents.another example. This is another reference [2]. with Example. This is an example text with This is an example text with This is one reference [1], [2]. This is an example text with Another example. This is an example text with This is an example text with Another example. Another example. This is an example text with [1] different documents. This is an example text with example. citations listed only in the same section. In CbPD, the pattern, order, co-occurrence etc. of citations is analyzed, allowing the identification of a text that has been translated from language A to language B, as the citations remain in a similar or even identical order. Citation Proximity Analysis (CPA) Instead of just using the bibliography, in CPA the proximity of the citations to each other in the full-text is used to calculate the Citation Proximity Index (CPI) in three steps. 1. The document is parsed and a series of heuristics are used to process the citations, including their position within the document 2. 2. The citations are assigned to their corresponding items in the bibliography. The overall margin of error with the system we have developed equals nearly three percent for the first and second step. 3. In the third step the proximity among each citation-pair is examined. The underlying assumption is that the closer the citations are to each other, the more likely it is that they are related. Based on this proximity analysis, the CPI is calculated. If for example two citations are given in the same sentence, the probability that they are related is higher (CPI = 1) than if they are cited only within the same paragraph (CPI = ¼). See Figure 4. Citing Document Document 1 Document 2 Document 3 CPI = ¼ CPI = 1 Figure 4: Illustration CPA However, further research needs to be performed to identify the appropriate weighting of the CPI values according to their occurrence, which also seems to 2 The citations were parsed using a modified version of parscit (http://wing.comp.nus.edu.sg/parscit) in combination with the authors self-developed software, which is available upon request.

This is an example text with [1] different documents. This is one reference [1], [2]. This is an example text with Lord of the Rings I Lord of the Rings I Lord of the Rings I Lord of the Rings I different documents.this is an example text with Another example. Another example. This is an example text with Another example. different documents.another example. This is another reference [2]. with Example. This is an example text with This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. Lord of the Rings I Lord of the Rings I Lord of the Rings I Lord of the Rings I This is the text of the quoting document. Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. This is the text of the quoting document. Harry Potter I Harry Potter I Harry Potter I This is an example text with Another example. Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II Lord of the Rings II This is an example text with This is one reference [1], [2]. This is an example text with Another example. This is an example text with This is an example text with Another example. Another example. Harry Potter I Harry Potter I Harry Potter I different documents.another example. This is another reference [2]. with Example. This is an example text with This is an example text with This is one reference [1], [2]. This is an example text with Another example. This is an example text with This is an example text with Another example. Another example. This is an example text with [1] different documents. This is an example text with example. depend on the publication s research field or type. It seems, for instance, that for analyzing a technical report or patent specification, different weightings seem more suitable than for a research article. The results delivered by CPA can be improved by evaluating as many sources as possible. This can be the case due to multiple occurrences of the same citation and due to multiple documents citing a certain document. In our series of tests we experienced the best results by calculating the weighted average of the CPIs. By automating the process described above, we have calculated the CPI for publications contained in the SciPlore database. The results show that in comparison to the results delivered by co-citation analysis, CPA delivers considerably better results in identifying related documents [4]. The same principle can be applied to links on websites or to quotations instead of citations (Quotation Proximity Analysis). If passages of two documents are quoted by a third document, the quoted documents are likely to be related. The closer the quotations are within the text of the quoting document, the higher the assumed relatedness as illustrated in the following figure. Distance = One Sentence à Highly Related Distance = One Paragraph à Less Related Review of Fantasy Books Lord of the Rings I Lord of the Rings II Harry Potter Figure 5: Quotation Proximity Analysis The Review of Fantasy book quotes passages from two different editions of Lord of the Rings and of Harry Potter. Between the quotes of the different Lord of the Rings volumes only one sentence occurs. Therefore, a relatively high relatedness of these two quotes/quoted books can be assumed. In contrast, the distance between the quote from Harry Potter and the Lord of the Rings is larger. Therefore, the relatedness of these quotes and the quoted books can be assumed to be lower, but still higher as if they would not appear at all in the same document. A modification of the approach also allows classifying unknown documents based on containing quotes. In the example, the Review of Fantasy Books could be classified automatically if at least one of the quoted books has already been classified. This is especially useful for documents not containing references or quotes as for instance in novels.

Citation based Plagiarism Detection Similar to the idea of CPA is another approach, which I call Citation based Plagiarism Detection (CbPD) or Citation Order Analysis [5]. Hundreds of papers have been published covering sophisticated approaches to detect plagiarism, and dozens of applications have been developed. All of them use more or less sophisticated approaches to analyze the text, but ignore the used citations [7, 12]. These approaches deliver good results in detecting copied text passages, but fail if text has been paraphrased or translated as shown in Figure 6. Degree of obfuscation / difficulty of detection C&P Plagiarism Disguised Plagiarism Paraphrase Translation Idea Plagiarism Substring Matching CLPD Fingerprinting Citation based Plagiarism Detection Bag of Words Analysis Stylometry High detection rate Medium detection rate Low detection rate Figure 6: Comparison of Plagiarism Detection Systems In contrast to CPA, in CbPD mainly uses factors such as citation order and pattern analysis. The main advantage in comparison to the usually applied text-analysis approaches is that even if documents were translated or paraphrased they can still be identified as similar. Figure 7 and Figure 8 illustrate the concept. Start Parse document to identify references no Contains references? yes Match references with database Apply text based plagiarism detection methods no Bib. coupling strength 2? yes Citation Pattern Analysis Significant similarities detected? no Significant similarities detected? no yes No evidence of plagiarism yes Potentially plagiarized Figure 7: Citation Pattern Analysis

This is an example text with This is one reference. This is an example text with references to different documents [1]. Two very similar references. This is an example text with This is an example text with Another example. Another example [2]. This is an example text with different documents. This is an example text with Another example. with Another example. Another documents [3]. Another examplethis is an example text with different documents.another example. This is another reference. with Example. This is an example text with This is an example text with different documents. This is one reference [1]. This is an example text with Another example. This is an example text with references to different documents.this is an example text with references to different documents.another example. Another example. Another example. This is an example text with references [2] to This is an example text with Another example. Another example [3]. This is an example text with references to different documents.another example. This is another reference. with Example. This is an example text with references to different documents [2]. Document A [1] [2] [3] Document B Figure 8: Illustration of Citation Order Analysis A comparison with the existing approaches is problematic, as both approaches have their own strengths. Whereas text-based approaches detect local similarity, like copied sentences, this citation-based approach analyzes global similarity. The interpretation, for instance, of a precision and recall value only makes sense when compared to other approaches. Since no other approaches exist for paraphrased and translated scientific text, such a comparison is not feasible. The test sets, like the PAN-PC-10 that was used at the Competition on Plagiarism Detection in 2010, are tailored to compare the performance of classical plagiarism detection systems, but are unsuitable to test this new approach, because citations were ignored. To evaluate our approach, we ran a test on 0.8 million scientific publications from open access repositories and hid among them 20 specially-designed plagiarized documents. To create a more realistic test scenario, we deleted some citations, added new ones, changed the order slightly, and changed the citation style. The outlined approach identified 19 of the test documents, along with hundreds that contained at least some plagiarized sections. One very short document was not identified; it cited five sources, of which we deleted two. Figure 9 shows how the CbPD compares to the currently used text-based detection approaches. It also indicates that the performance is best if the text-based and the citation-based approach are combined.

Fingerprinting Vector Space Retrival Substring Matching Intrinsic Citation Pattern Based Combined (Text&Citation) existing new above average = 1 average = 2 below average = 3 Copy&Paste (c&p) 1 1 1 2 1 1 Shake&Paste (s&p) 1 1 1 2 2 1 Expansive 2 2 3 3 2 1 Contractive 1 1 2 3 2 1 Mosaic 2 2 2 3 3 2 Technical disguise 3 3 3 3 1 1 Undue paraphrase 3 3 3 3 1 1 Translated 3 3 3 3 1 1 Idea plagiarism 3 3 3 3 2 2 Self-plagiarism 1 1 1 3 1 1 Figure 9: Comparison of Detection Quality By lowering the threshold, not only can plagiarism be detected, but also documents which have not been cited, that were involved in the creation process. Figure 10 shows an example. Document A was probably read by the author of Document B, but Doc A was not cited. This is not usually considered plagiarism, but knowledge concerning which papers were involved in the creation process can be of interest. 2000 2002 [1] [2] [3] [x] [x] 2004 [x] [x] 2006 Doc A [x] 2008 2010 Doc B page 1 Doc B page 2 Doc B page 3 Figure 10: Identification of non-cited documents

6 Conclusion This paper gave an overview of my PhD research project, which addresses the difficulty of measuring document relatedness in order to e.g. improve Academic Recommendation Systems and to identify plagiarism. Two approaches were presented and their advantages and disadvantages discussed. For more in-depth information please consult my publications. References [1] Joeran Beel and Bela Gipp. The Potential of Collaborative Document Evaluation for Science. In George Buchanan, Masood Masoodian, and Sally Jo Cunningham, editors, 11th International Conference on Asia-Pacific Digital Libraries (ICADL 08) Proceedings, volume 5362 of Lecture Notes in Computer Science (LNCS), pages 375 378, Heidelberg (Germany), December 2008. Springer. Also available on http://www.sciplore.org. [2] RM Fano. Information theory and the retrieval of recorded information. In Documentation in Action: Based on 1956 Conference on Documentation at Western Reserve University, page 238. Reinhold Publishing Corp., 1956. [3] E. Garfield. From bibliographic coupling to co-citation analysis via algorithmic historio-bibliography. volume 27, 2001. [4] Bela Gipp and Joeran Beel. Citation Proximity Analysis (CPA) - A new approach for identifying related work based on Co-Citation Analysis. In Birger Larsen and Jacqueline Leta, editors, Proceedings of the 12th International Conference on Scientometrics and Informetrics (ISSI 09), volume 2, pages 571 575, Rio de Janeiro (Brazil), July 2009. International Society for Scientometrics and Informetrics. ISSN 2175-1935. Also available on http://www.sciplore.org. [5] Bela Gipp and Joeran Beel. Citation Based Plagiarism Detection - A New Approach to Identify Plagiarized Work Language Independently. In Proceedings of the 21st ACM Conference on Hyptertext and Hypermedia (HT 10), pages 273 274, New York, NY, USA, June 2010. ACM. [6] R. Klavans and K.W. Boyack. Identifying a better measure of relatedness for mapping science. Journal of the American Society for Information Science and Technology, 57(2):251 263, 2006. [7] R. Lukashenko, V. Graudina, and J. Grundspenkis. Computer-based plagiarism detection methods and tools: An overview. In Proceedings of the 2007 international conference on Computer systems and technologies, page 40. ACM, 2007. [8] IV Marshakova. System of document connections based on references. Scientific and Technical Information Serial of VINITI, 6(2):3 8, 1973. [9] P. Resnik et al. Using information content to evaluate semantic similarity in a taxonomy. In International Joint Conference on Artificial Intelligence, volume 14, pages 448 453, 1995. [10] A. Rip and J.P. Courtial. Co-word maps of biotechnology: An example of cognitive scientometrics. Scientometrics, 6(6):381 400, 1984.

[11] H Small. Co-citation in the scientific literature: a new measure of the relationship between two documents. Journal of the American Society for Information Science, 24:265 269, 1973. [12] Benno Stein, Paolo Rosso, Efstathios Stamatatos, Moshe Koppel, and Agirre Eneko, editors. Proceedings of the 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, 2009. [13] C. Sternitzke and I. Bergmann. Similarity measures for document mapping: a comparative study on the level of an individual scientist. Scientometrics, 78(1):113 130, 2009.