Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

Similar documents
Research Evaluation Metrics. Gali Halevi, MLS, PhD Chief Director Mount Sinai Health System Libraries Assistant Professor Department of Medicine

What is Web of Science Core Collection? Thomson Reuters Journal Selection Process for Web of Science

Figures in Scientific Open Access Publications

Corso di dottorato in Scienze Farmacologiche Information Literacy in Pharmacological Sciences 2018 WEB OF SCIENCE SCOPUS AUTHOR INDENTIFIERS

Citation & Journal Impact Analysis

What do you mean by literature?

Web of Science Unlock the full potential of research discovery

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

Introduction. Status quo AUTHOR IDENTIFIER OVERVIEW. by Martin Fenner

arxiv: v1 [cs.dl] 8 Oct 2014

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Measuring the reach of your publications using Scopus

Battle of the giants: a comparison of Web of Science, Scopus & Google Scholar

Academic Identity: an Overview. Mr. P. Kannan, Scientist C (LS)

Citation analysis: Web of science, scopus. Masoud Mohammadi Golestan University of Medical Sciences Information Management and Research Network

Scopus. Advanced research tips and tricks. Massimiliano Bearzot Customer Consultant Elsevier

The Google Scholar Revolution: a big data bibliometric tool

Citation Analysis. Presented by: Rama R Ramakrishnan Librarian (Instructional Services) Engineering Librarian (Aerospace & Mechanical)

WEB OF SCIENCE THE NEXT GENERATAION. Emma Dennis Account Manager Nordics

BIG DATA IN RESEARCH IMPACT AMINE TRIKI CUSTOMER EDUCATION SPECIALIST DECEMBER 2017

Promoting your journal for maximum impact

1.1 What is CiteScore? Why don t you include articles-in-press in CiteScore? Why don t you include abstracts in CiteScore?

Citation Metrics. From the SelectedWorks of Anne Rauh. Anne E. Rauh, Syracuse University Linda M. Galloway, Syracuse University.

Elsevier Databases Training

Bibliometric Analysis of the Korean Journal of Parasitology: Measured from SCI, PubMed, Scopus, and Synapse Databases

Enabling editors through machine learning

How to Choose the Right Journal? Navigating today s Scientific Publishing Environment

Workshop Training Materials

THE USE OF THOMSON REUTERS RESEARCH ANALYTIC RESOURCES IN ACADEMIC PERFORMANCE EVALUATION DR. EVANGELIA A.E.C. LIPITAKIS SEPTEMBER 2014

InCites Indicators Handbook

Your research footprint:

Microsoft Academic: is the Phoenix getting wings?

Detecting Musical Key with Supervised Learning

Bibliometric measures for research evaluation

DON T SPECULATE. VALIDATE. A new standard of journal citation impact.

MEASURING EMERGING SCIENTIFIC IMPACT AND CURRENT RESEARCH TRENDS: A COMPARISON OF ALTMETRIC AND HOT PAPERS INDICATORS

INTRODUCTION TO SCIENTOMETRICS. Farzaneh Aminpour, PhD. Ministry of Health and Medical Education

Arjumand Warsy

INTRODUCTION TO SCIENTOMETRICS. Farzaneh Aminpour, PhD. Ministry of Health and Medical Education

Measuring Academic Impact

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Eigenfactor : Does the Principle of Repeated Improvement Result in Better Journal. Impact Estimates than Raw Citation Counts?

Cited Publications 1 (ISI Indexed) (6 Apr 2012)

Finding Influential journals:

F. W. Lancaster: A Bibliometric Analysis

Russian Index of Science Citation: Overview and Review

Finding a Home for Your Publication. Michael Ladisch Pacific Libraries

PubMed, PubMed Central, Open Access, and Public Access Sept 9, 2009

Complementary bibliometric analysis of the Health and Welfare (HV) research specialisation

Suggested Publication Categories for a Research Publications Database. Introduction

Section Description Pages I General Notes 1. Documents that are not found using Author Search but are found when using Cited Reference Search

CONTRIBUTION OF INDIAN AUTHORS IN WEB OF SCIENCE: BIBLIOMETRIC ANALYSIS OF ARTS & HUMANITIES CITATION INDEX (A&HCI)

University of Liverpool Library. Introduction to Journal Bibliometrics and Research Impact. Contents

The largest abstract and citation database

Corso di Informatica Medica

Lokman I. Meho and Kiduk Yang School of Library and Information Science Indiana University Bloomington, Indiana, USA

Using Endnote to Organize Literature Searches Page 1 of 6

Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes

China s Overwhelming Contribution to Scientific Publications

SCOPUS : BEST PRACTICES. Presented by Ozge Sertdemir

How comprehensive is the PubMed Central Open Access full-text database?

EndNote Essentials. EndNote Overview PC. KUMC Dykes Library

Suggestor.step.scopus.com/suggestTitle.cfm 1

The use of bibliometrics in the Italian Research Evaluation exercises

Complementary bibliometric analysis of the Educational Science (UV) research specialisation

Using Bibliometric Analyses for Evaluating Leading Journals and Top Researchers in SoTL

DISCOVERING JOURNALS Journal Selection & Evaluation

Finding Influential journals:

Google Scholar and ISI WoS Author metrics within Earth Sciences subjects. Susanne Mikki Bergen University Library

2013 Environmental Monitoring, Evaluation, and Protection (EMEP) Citation Analysis

EndNote X6 Workshop Michigan State University Libraries

Are Your Citations Clean? New Scenarios and Challenges in Maintaining Digital Libraries

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Bibliometric Rankings of Journals Based on the Thomson Reuters Citations Database

Focus on bibliometrics and altmetrics

[MB Docket Nos , ; MM Docket Nos , ; CS Docket Nos ,

Microsoft Academic is one year old: the Phoenix is ready to leave the nest

The rate of growth in scientific publication and the decline in coverage provided by Science Citation Index

Scientometrics & Altmetrics

NYU Scholars for Department Coordinators:

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

AGENDA. Mendeley Content. What are the advantages of Mendeley? How to use Mendeley? Mendeley Institutional Edition

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

CITATION METRICS WORKSHOP (WEB of SCIENCE)

Chapter 7: Importing Reference

Running a Journal.... the right one

Working Paper Series of the German Data Forum (RatSWD)

Scientific Literature

Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Scopus. Content Coverage Guide

Identifying Related Documents For Research Paper Recommender By CPA and COA

Absolute Relevance? Ranking in the Scholarly Domain. Tamar Sadeh, PhD CNI, Baltimore, MD April 2012

New analysis features of the CRExplorer for identifying influential publications

Indian Journal of Science International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Journal Citation Reports Your gateway to find the most relevant and impactful journals. Subhasree A. Nag, PhD Solution consultant

IST Get It + Document Delivery. User Guide

What's New in Journal Citation Reports?

International Journal of Library and Information Studies ISSN: Vol.3 (3) Jul-Sep, 2013

EndNote Web. Quick Reference Card THOMSON SCIENTIFIC

JOURNAL IMPACT FACTOR. 3-year calculation window (2015, 2016, and 2017)

Transcription:

Supplementary Note Of the 100 million patent documents residing in The Lens, there are 7.6 million patent documents that contain non patent literature citations as strings of free text. These strings have been obtained from the DOCDB master documentation database. As noted above, however, many of the patent documents pertain to the same underlying invention, both because within a jurisdiction there will be multiple documents for a given application (e.g. application, search reports, issued patent) and because the same invention can be the basis for applications in multiple jurisdictions. A patent family is a set of documents, often across multiple jurisdictions, that all pertain to the same invention. As shown in Table X, the 100M patent documents represent almost 54.87 million families; of these 21.15 million are families in which at least one patent has been granted as of 1790 (ref 42). 1 For metrics purposes, it might be important to distinguish inventions with a granted patent from those without, but for our mapping purpose, we include all citations from all families, by the indicated technological activity by a given party in a given area, whether or not that activity has succeeded in a patent grant. More than 4.7 million of the 55 million families contain at least one NPL citation. Multiple documents in the same family containing a citation to the same scholarly article do not meaningfully represent multiple citations. Hence, we eliminated duplicate citations within a family from our analysis. On this basis, there are 31.6 million total NPL citation strings, or about 6.7 strings per family (conditional on the family having at least one). Supplementary Table 1 show more stats on the coverage of total NPL citation strings (resolved and unresolved with unique identifiers) in the Lens by January 25, 2017 and the resolved citations. Patent counts were based on simple patent families rather than individual document counts. Supplementary Table 1 Measure Coverage in all patent Coverage in patent families with a granted 1 https://www.lens.org/lens/search?s=pub_date&d=%2b&q=&dates=%2bpub_date%3a18000101-20170228&types=granted+patent

families patent Resolved and unresolved NPL citation strings with unique identifiers Total number of patent families in The Lens 54,872,483 21,156,119 Number of patent families with a NPL citation string (whether resolved to an identifier or not) 4,701,336 3,425,91 Total number of NPL citations US_NPL citations WIPO_NPL citations EPO_NPL citations China_NPL citations 35,964,774 24,177,520 4,510,576 4,207,771 1,334,114 Not determined Number of total NPL citations after duplicates within a patent family are eliminated Average number of NPL citation strings per patent family (all) Maximum number of NPL citation strings in a patent family 31,629,031 27,499,726 6.7* 8.0 3,238 3,238 Resolved NPL citation strings with unique identifiers Number of resolved citations to DOI* 8,302,553 7,184,318 Average number of DOI citations per family 1.8* 2.1 Maximum number of DOI citations per family 608 608 Number of citations resolved to PMID* 5,864,716 4,900,595 Average number of PMID citations per family 1.3* 1.4 Maximum number of PMID citations in a 773 773 2

family * A wide variation in number of citations per patent family was observed. Estimated overlap between DOI and PMID is 4.3 million Supplementary Methods PubMed is a highly enriched, standardized bibliographic dataset of the majority of the scholarly literature in the life and health sciences, each article of which is provided an unambiguous persistent identifier, called a PMID. Current holdings exceed 25 million records, and most include rich value-added data such as Medical Subject Headings (MeSH), textual data such as abstracts, links to primary articles, institutional affiliations etc. To support enhanced NPL citation resolution, NCBI developed a new indexing engine called Hydra. The goal of Hydra is to provide accurate resolution of query text to existing citations while minimizing the rate of false positive matches. 2 The focus of Hydra is on text available from abstracts, titles, author lists, and journal imprints (see next section below) Crossref is a non-profit industry association that - in collaboration with its thousands of publisher members - enables the assignment and resolution of unique, persistent and unambiguous Digital Object Identifiers (DOIs) to scholarly work. Members of Crossref include both for-profit and non-profit, open and proprietary publishers, and their holdings currently exceed 80 million records. Assignment of DOIs extends into all research disciplines, including life sciences, physical, chemical, mathematical, economic and social sciences; it also includes books, monographs, conference proceedings and other works that are often important as prior art. 3 Resolving of NPL citations was done through the use of Crossref API. 4 Description of the Hydra algorithm for Citation Matching Hydra considers all abstracts from the publicly accessible medical and biological literature as provided through PubMed. From the text, all single words and two- and three-word phrases are extracted and included in the final dictionary. For each term, the initial field is 2 Navarro, G. A guided tour to approximate string matching. ACM Computing Surveys 33, 31-88, doi:10.1145/375360.375365 (2001). 3 Paskin, N. Digital object identifiers for scientific data. Data Science Journal 4, 1-9 (2005) accessible at http://www.doi.org/topics/050210codataarticledsj.pdf 4 http://search.crossref.org/help/api - see /links endpoint 3

remembered for purposes of ranking. Stop words are in general included; however, all two- or three-word phrases beginning with a stop word are dropped. We apply a limited term synonymy extraction, designed primarily to account for multiplicity among journal name representations. For example, The New England Journal of Medicine may be cited as New Engl J Med, or possibly NEJM. Author names are expanded to include variants of representation that include initials, full names, or last name only; dates are expanded to multiple possible forms, reflecting different representations in journal formatting. The full list of terms by documents are extracted into a series of postings files for indexed look-up. There are two post-indexing stages used in the Hydra system to improve search performance in both execution time and fidelity. First, we extract multiple Bloom filters, used to exclude terms that are not present in the database. Second, we establish a set of per-field weights that are applied at run-time to rank results according to best-fit for citations. The per-field weights are determined using a tree-based boosting algorithm. We use a manually reviewed set of 5,000 user-provided queries as the basis for training, including both positive and negative citation searches from a set of user-supplied PubMed queries. Boosting requires the application of multiple independent algorithms to determine approximations of final field weights. Hydra currently employs two algorithms: a fuzzy-logic classifier employing a logistic regression and a ranking algorithm that uses field weights to establish relevance of articles. Field weights are initially biased toward English language abstracts and reviews from high-impact journals. Searching within the Hydra system evaluates all incoming terms for presence within the postings file, using Bloom filters to exclude terms. All terms are permuted to induce a singleletter change (insertion, deletion, or change), generating a set of candidate terms with a Levenshtein edit distance of 1 from the initial set of terms. This expansion accounts for common artifacts and misspellings. Terms obtained through edit distance expansion are given a lower weight than initial terms. Search then proceeds using an estimator function that eliminates terms that are unlikely to affect the final ranking. Results are evaluated using a merge-sort-ranking algorithm that involves evaluating postings vectors in increasing order of size (shortest vectors first) and proceeds until at least three vectors or 10,000,000 documents are evaluated. As a result, high frequency single-word terms may not be evaluated at all, in preference to evaluating the two- and three-word phrases containing such high frequency words. Final scores are modified based on training weights, applied to fielded terms at merge-sort time. 4

To support final citation matching, we further require that there is exactly one unambiguous match. The highest ranking item is then returned; in cases of score ties, all ties are returned, but the result would not be considered specific enough to qualify as a matched citation. Additional readings 1. Friedman, J. H. Greedy Function Approximation: A Gradient Boosting Machine. IMS 1999 Reitz Lecture. http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf 2. Mason, L.; Baxter, J.; Bartlett, P. L.; Frean, Marcus (1999). "Boosting Algorithms as Gradient Descent". In S.A. Solla and T.K. Leen and K. Müller. Advances in Neural Information Processing Systems 12. MIT Press. pp. 512 518. 3. Bloom, Burton H. (1970), "Space/Time Trade-offs in Hash Coding with Allowable Errors", Communications of the ACM 13 (7): 422 426, doi:10.1145/362686.362692 4. Levenshtein, Vladimir I. (February 1966). "Binary codes capable of correcting deletions, insertions, and reversals". Soviet Physics Doklady 10 (8): 707 710 5. Navarro, Gonzalo (2001). "A guided tour to approximate string matching". ACM Computing Surveys 33 (1): 31 88. doi:10.1145/375360.375365. Supplementary Table 2. Number of unique articles in the various datasets Dataset Unique articles Articles_Clarivate Analytics_total 13,770,091 Resolved articles with identifiers 11,748,697 Resolved & cited articles in the patent literature 1,202,523 Average cited article /total resolved articles 0.10 Number of citing patents 1,117,712 Number of citing patents, weighted by family 4,659,652 Number of citing patent families 689,097 5

Average citing patent/article (using resolved articles) 0.4 Supplementary Table 3. Top 50 research institutions based on the normalized aggregate citation counts*.[see Excel file] *For the normalized aggregate citation measure, we first weighted citations in each of the 10 research disciplines and summed the normalized citations across all disciplines. To calculate the normalized In4M metric, we divided the normalized aggregate citation counts by the resolved articles. To view all 200 institutions rankings, go to https://www.lens.org/lens/in4m#/rankings/global/locations 6

7