BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Similar documents
RESEARCH TRENDS IN INFORMATION LITERACY: A BIBLIOMETRIC STUDY

K-means and Hierarchical Clustering Method to Improve our Understanding of Citation Contexts

VISIBILITY OF AFRICAN SCHOLARS IN THE LITERATURE OF BIBLIOMETRICS

A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES

Should author self- citations be excluded from citation- based research evaluation? Perspective from in- text citation functions

Bibliometric Analysis of the Indian Journal of Chemistry

Publication boost in Web of Science journals and its effect on citation distributions

Bibliometric Rankings of Journals Based on the Thomson Reuters Citations Database

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Bibliometric glossary

Applicability of Lotka s Law and Authorship pattern in the field of Mathematical Science Research: A Scientometric Study

attached to the fisheries research Institutes and

of Nebraska - Lincoln

On the causes of subject-specific citation rates in Web of Science.

Growth of Literature and Collaboration of Authors in MEMS: A Bibliometric Study on BRIC and G8 countries

AUTHORSHIP PATTERN: SCIENTOMETRIC STUDY ON CITATION IN JOURNAL OF DOCUMENTATION

International Journal of Library Science and Information Management (IJLSIM)

Discussing some basic critique on Journal Impact Factors: revision of earlier comments

Alphabetical co-authorship in the social sciences and humanities: evidence from a comprehensive local database 1

A Taxonomy of Bibliometric Performance Indicators Based on the Property of Consistency

THE USE OF THOMSON REUTERS RESEARCH ANALYTIC RESOURCES IN ACADEMIC PERFORMANCE EVALUATION DR. EVANGELIA A.E.C. LIPITAKIS SEPTEMBER 2014

Research Ideas for the Journal of Informatics and Data Mining: Opinion*

CITATION CLASSES 1 : A NOVEL INDICATOR BASE TO CLASSIFY SCIENTIFIC OUTPUT

Predicting the Importance of Current Papers

International Journal of Library and Information Studies ISSN: Vol.3 (3) Jul-Sep, 2013

Citations and Self Citations of Indian Authors in Library and Information Science: A Study Based on Indian Citation Index

STI 2018 Conference Proceedings

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

CONTRIBUTION OF INDIAN AUTHORS IN WEB OF SCIENCE: BIBLIOMETRIC ANALYSIS OF ARTS & HUMANITIES CITATION INDEX (A&HCI)

To Link this Article: Vol. 7, No.1, January 2018, Pg. 1-11

ISSN: ISO 9001:2008 Certified International Journal of Engineering Science and Innovative Technology (IJESIT) Volume 3, Issue 2, March 2014

VOLUME-I, ISSUE-V ISSN (Online): INTERNATIONAL RESEARCH JOURNAL OF MULTIDISCIPLINARY STUDIES

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

THE EVALUATION OF GREY LITERATURE USING BIBLIOMETRIC INDICATORS A METHODOLOGICAL PROPOSAL

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

A Discriminative Approach to Topic-based Citation Recommendation

CITATION ANALYSES OF DOCTORAL DISSERTATION OF PUBLIC ADMINISTRATION: A STUDY OF PANJAB UNIVERSITY, CHANDIGARH

Google Scholar and ISI WoS Author metrics within Earth Sciences subjects. Susanne Mikki Bergen University Library

Waste Water Management by means of Scientometric Study

PBL Netherlands Environmental Assessment Agency (PBL): Research performance analysis ( )

Analysis of local and global timing and pitch change in ordinary

researchtrends IN THIS ISSUE: Did you know? Scientometrics from past to present Focus on Turkey: the influence of policy on research output

Application of Lotka s Law in the field of. Human Biology Journal 2007

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

Detecting Musical Key with Supervised Learning

MOBILE TECHNOLOGY PUBLICATIONS RESEARCH OUTPUT AS INDEXED IN ENGINEERING INDEX: A SCIENTOMETRIC ANALYSIS

CS229 Project Report Polyphonic Piano Transcription

Accpeted for publication in the Journal of Korean Medical Science (JKMS)

Citation Concentration in ASLIB Proceedings Journal: A Comparative Study of 2005 and 2015 Volumes

Open Source Software for Arabic Citation Engine: Issues and Challenges

A systematic empirical comparison of different approaches for normalizing citation impact indicators

Frequencies. Chapter 2. Descriptive statistics and charts

Name That Song! : A Probabilistic Approach to Querying on Music and Text

More About Regression

Scientometric Analysis of Astrophysics Research Output in India 26 years

Identifying the Importance of Types of Music Information among Music Students

Identifying Related Documents For Research Paper Recommender By CPA and COA

Publication Boost in Web of Science Journals and Its Effect on Citation Distributions

Using Bibliometric Analyses for Evaluating Leading Journals and Top Researchers in SoTL

Open Access Determinants and the Effect on Article Performance

How economists cite literature: citation analysis of two core Pakistani economic journals

F1000 recommendations as a new data source for research evaluation: A comparison with citations

A Scientometric Study of Digital Literacy in Online Library Information Science and Technology Abstracts (LISTA)

Full-Text based Context-Rich Heterogeneous Network Mining Approach for Citation Recommendation

What are Bibliometrics?

Bibliometric Analysis of Literature Published in Emerald Journals on Cloud Computing

Figures in Scientific Open Access Publications

Alfonso Ibanez Concha Bielza Pedro Larranaga

A Statistical Framework to Enlarge the Potential of Digital TV Broadcasting

Desidoc Journal of Library and Information Technology during : A Bibliometric Analysis

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

The use of bibliometrics in the Italian Research Evaluation exercises

Special Article. Prior Publication Productivity, Grant Percentile Ranking, and Topic-Normalized Citation Impact of NHLBI Cardiovascular R01 Grants

The Proportion of NUC Pre-56 Titles Represented in OCLC WorldCat

MEASURING EMERGING SCIENTIFIC IMPACT AND CURRENT RESEARCH TRENDS: A COMPARISON OF ALTMETRIC AND HOT PAPERS INDICATORS

MORAVIAN GEOGRAPHICAL REPORTS. Guide for Authors

The journal relative impact: an indicator for journal assessment

Citations, research topics and active countries in software engineering: A bibliometrics study

ECE302H1S Probability and Applications (Updated January 10, 2017)

A study of scientometrics analysis of research output performance of malaria

Indian Journal of Science International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Scientometric and Webometric Methods

What is bibliometrics?

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

2013 Environmental Monitoring, Evaluation, and Protection (EMEP) Citation Analysis

Appropriate and Inappropriate Uses of Journal Bibliometric Indicators (Why do we need more than one?)

SWITCHED INFINITY: SUPPORTING AN INFINITE HD LINEUP WITH SDV

Hidden Markov Model based dance recognition

Can scientific impact be judged prospectively? A bibliometric test of Simonton s model of creative productivity

The use of citation speed to understand the effects of a multi-institutional science center

Which percentile-based approach should be preferred. for calculating normalized citation impact values? An empirical comparison of five approaches

In basic science the percentage of authoritative references decreases as bibliographies become shorter

Constructing bibliometric networks: A comparison between full and fractional counting

Simulation Study of the Spectral Capacity Requirements of Switched Digital Broadcast

AC : ANALYSIS OF ASEE-ELD CONFERENCE PROCEEDINGS:

arxiv: v1 [cs.dl] 8 Oct 2014

Bibliometric Analysis of Electronic Journal of Knowledge Management

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Transcription:

Electronic Journal of Applied Statistical Analysis EJASA (2012), Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 353 359 e-issn 2070-5948, DOI 10.1285/i20705948v5n3p353 2012 Università del Salento http://siba-ese.unile.it/index.php/ejasa/index BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE Francesca De Battisti *, Silvia Salini Department of Economics, Management and Quantitative Methods, University of Milan, Italy Received 19 July 2012; Accepted 07 October 2012 Available online 16 November 2012 Abstract: A bibliographic record, related to a product, is composed by different information: authors, year, source, publisher, keywords, abstract, citations and so on. Citations usually have a central role in bibliometric analysis. The study of textual information could be a different analysis perspective. The idea is that documents are mixture of latent topics, where a topic is a probability distribution over words. In this paper we try to show how the scientific productivity of a research group can be described using topic models. Moreover, for the same sample, we test if the other bibliometric measures follow the known distribution laws. Keywords: Text mining, topic models, bibliometrics, distribution laws 1. Introduction A bibliometric database contains a large amount of different information, making possible different types of analysis [8, 7]. The purpose of the study is to present an overview of them, focusing on the analysis of textual information in order to extract the latent topics that characterize the papers. Bibliographic data are complex, different type of information and objects are involved: measures (counts, indices), networks (co-citations, co-authorships), textual data (title, keywords, abstract, full-text). Bibliometrics, define by the Oxford English Dictionary as the branch of library science concerned with the application of mathematical and statistical analysis to bibliography; the statistical analysis of books, articles, or other publications, could be used with two main aims: evaluation of research and measure of science. In this paper we focus on the second one. Web of Science database, edited by the Institute for Scientific Information and distributed by Thomson Reuters (http://isiwebofknowledge.com/), is used for this exercise. The database is queried with reference to scientific output of all Researchers in Statistics, SECS/S01 (444 * E-mail: francesca.debattisti@unimi.it 353

Bibliographic data: a different analysis perspective Subjects). We analyse 302 authors and 1309 products. In Section 2, topic models are presented and applied. In section 3 bibliometrics laws are briefly described and tested in order to verify if they are satisfied by our data. Finally, future developments and conclusions are proposed. 2. Topic Models Topic models are based upon the idea that documents are mixture of topics, where a topic is a probability distribution over words. The documents are observed, the topics (and their distributions) are considered as hidden structures or latent variables. Topic modelling algorithms are statistical methods that analyse the words of the original texts to discover the themes that run through them, how these themes are connected to each other, and how they change over time [1]. The simplest and most commonly used probabilistic topic approach to document modelling is the generative model Latent Dirichlet Allocation (LDA) [4]. The idea behind LDA is that documents blend multiple topics. A topic is defined to be a distribution over a fixed vocabulary. For example the statistics topic has words about statistics with high probability. The model assumes that the topics are generated before the documents. For each document, the words are generated in a two-stage process: i) randomly choose a distribution over topics (Dirichlet distribution); ii) for each word first randomly choose a topic from the distribution over topics and then randomly choose a word from the corresponding distribution over the vocabulary. The central problem for topic modelling is to use the observed documents to infer the latent variables. Topic models are probabilistic models in which data are treated as arising from a generative process that includes hidden (or latent) variables. This process defines a joint probability distribution over both the observed and hidden random variables. The conditional distribution of the hidden variables given the observed variables, also called posterior distribution, is computed. The numerator of the conditional distribution is the joint distribution of all the random variables, which can be easily computed; the denominator is the marginal probability of the observations, or the probability of seeing the observed corpus under any topic model. Theoretically, it can be computed by summing the joint distribution over every possible instantiation of the hidden topic structure; practically, because the number of possible topic structures is exponentially large, this sum is difficult to compute. Topic modelling algorithms fall into two categories, which propose different alternative distributions to approximate the true posterior: sampling-based algorithms, as Gibbs sampling, and variational algorithms. The first group considers a Markov chain, a sequence of random variables, each dependent of the previous, whose limiting distribution is the posterior [19]; the second group of algorithms, instead, represents a deterministic alternative to sampling-based algorithms (VEM). Rather than approximating the posterior with samples, variational methods posit a parameterized family of distributions over the hidden structure and find the member of the family that is closest to the posterior; in this way, they transforms the inference problem to a optimisation problem. In 2007 a correlated topic model (CTM), which explicitly models the correlation between the latent topics in the documents, has been introduced [3]. We have fitting topic models using the R Package Topic models [10]. To choose the optimal number of topics, perplexity is calculated [4]. The perplexity, used by convention in language modelling, is monotonically decreasing in the likelihood of the test data; a lower perplexity score indicates better generalization performance. 354

De Battisti, F., Salini, S. (2012). Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 353 359. Figure 1. Perplexity by number of topics for VEM and CTM. We have compared perplexity between VEM and CTM algorithms. The optimal number of topics looking to Figure 1 seems to be n = 30, because after this value the functions become stationary. By topic identification, papers can be clustered. It is useful to evaluate the probabilities of assignment to the most likely topic for all documents for the estimation model chosen and to calculate the number of papers corresponding to each topic, when the most relevant one is considered (see Figure 2). Figure 2. CTM: most likely topic distribution (left) and topic relevance (right). It is also important to examine the strength of each topic over time, providing quantitative measures of the prevalence of particular kinds of research [9]. Figure 3. Topic trends according to their relevance by year. 355

Bibliographic data: a different analysis perspective Figure 3 shows only the topics for which there is a significant difference of the relevance means over the years: for example, in 2002 topic 26 was popular. Below the words associated with topics 26, 24 and 2 are listed: Topic 26: bayes, factor, prior, design, priors, size, sample, evidence, fractional, trials Topic 24: estimator, method, simulation, function, integration, estimation, tree, measurement, reliability, risk Topic 2: fuzzy, component, principal, approach, clustering, dynamic, time, squares, interval, spatial 3. Bibliometric Laws The laws of bibliometrics originated in the first half of the 900 to describe, monitor and model the production, use and dissemination of knowledge. In particular, Lotka s law [12] characterizes the frequency of publications by author in a given field; Bradford s law [6] is useful for librarians in determining the number of core journals in any field; finally, Zipf s law [20] is often used to predict the frequency of words within a text. The Lotka distribution is based on an inverse square law where the number of authors writing n papers is 1/n 2 of the number of authors writing one paper. In order to test the applicability of Lotka s law to our data, for a given number of paper (NP), the number of authors (NA), the observed relative frequencies (Obs) and the expected ones (Exp) are reported in Table 1 and plotted in Figure 4. Moreover, a test based on the distance between the two cumulative quantities can be done [16]. Table 1. Observed and expected frequencies of authors for number of papers. NP NA Obs CumObs Exp CumExp Dist 1 70 0.23 0.23 0.61 0.61 0.38 2 54 0.18 0.41 0.15 0.76 0.03 3 39 0.13 0.54 0.07 0.83 0.06 4 28 0.09 0.63 0.04 0.87 0.05 5 31 0.10 0.74 0.02 0.89 0.08 6 24 0.08 0.81 0.02 0.91 0.06 7 13 0.04 0.86 0.01 0.92 0.03 8 10 0.03 0.89 0.01 0.93 0.02 9 7 0.02 0.91 0.01 0.94 0.02 10 6 0.02 0.93 0.01 0.94 0.01 Figure 4. Observed and expected percentage of authors for number of papers. The Kolmogorov-Smirnov (K-S) test is based on the maximum deviation D = Max CumExp CumObs. At a 0.01 level of significance, the K-S statistic is equal to 0.094. If D is greater than the K-S statistic, then the sample distribution does not fit the theoretical distribution. In our case, D is 0.38, so Lotka s law does not apply to our data. Review of literature [16], different criticisms [15] and re-evaluation of the law [13] were proposed. 356

De Battisti, F., Salini, S. (2012). Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 353 359. Figure 5. Distribution of the top 20 journals. The Bradford distribution groups journals and articles to identify the number of periodicals relevant to a particular subject. A core of journals is thus identified which could be used to select the essential journals for a special collection. Bradford s distribution was made more general by grouping journals according to the number of citation they receive [11]. In the Figure 5 the most frequent top 20 journals of Italian statisticians are shown. The citation distribution provides basic insight about the relative popularity of scientific publications. The number of citations received by scientific papers appears to have a power-law distribution [14, 17]. The distribution of citations is a rapidly decreasing function of citation count but does not appear to be described by a single function over the entire range of this variable [18]. Zipf plot is well suited for determining the large-x tail of the citation distribution. Figure 6. H-index. Figure 7. Lorenz Curve. The Figure 6 shows the distribution of papers ranked by decreasing citations. The intersection between the paper distribution and the diagonal is the H-index of the Italian Statisticians community. When the research group is the unit of analysis, some measures of concentration should be computed. In the Lorenz curve, the cumulative proportion of articles (x-axis) is plotted against the cumulative proportion of their total citations on the y-axis. Lorenz curve captures the degree of inequality or concentration. If each article had equal value in its shares of the total citations, it would plot as a straight diagonal line (the perfect equality line); if the observed curve deviates from the perfect equality line, the articles do not contribute equally strongly to the total number of citations [5]. In our case, as confirmed by Gini s index equal to 0,956, there is a very high 357

Bibliographic data: a different analysis perspective degree of concentration; indeed, the 67% of papers correspond to 0 citations (see Figure 7). 4. Conclusions and Future Perspectives Concerning to topic models, LDA and CTM assume that documents are exchangeable within the corpus and, for many corpora, this assumption is inappropriate. The topics of a document collection evolve over time. The dynamic topic model (DTM) captures the evolution of topics in a sequentially organized corpus of documents [2]. In the future we will study the evolution of topics over time and the similarity between them. Furthermore, it will be interesting to evaluate, maybe by association rules or map of science, if there are significant associations among topics, journals, country, author/ citation networks, time. Concerning to distribution laws, a simulation study will be implemented to identify the factors that could influence them: field or area, time period, type of publication and so on. References [1]. Blei, D. M. (2011). Introduction to Probabilistic Topic Models. Princeton University. [2]. Blei, D.M., Lafferty, J.D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning, 113-120. [3]. Blei, D.M., Lafferty, J.D. (2007). A correlated topic model of science. The Annals of Applied Statistics. [4]. Blei, D.M., Ng, A.Y., Jordan, M.I. (2003). Latent Dirichlet allocation. The Journal of Machine Learning Research. [5]. Bornmann, L., Mutz, R., Neuhaus, C., Daniel, H. (2008). Citation counts for research evaluation: standards of good practice for analyzing bibliometric data and presenting and interpreting results. Ethics in Science and Environmental Politics. [6]. Bradford, S.C. (1934). Sources of information on specific subjects. Engineering, 137, 85-6. [7]. De Battisti, F., Salini, S. (2012). Robust analysis of bibliometric data. Statistical Methods & Applications. In press. DOI 10.1007/s10260-012-0217-0. [8]. Ferrara, A., Salini, S. (2012). Ten challenges in bibliographic data for bibliometrics analysis. Scientometrics, 93-3, 765-785. [9]. Griffiths, T., Steyvers, M. (2004). Finding scientific topics. Proceeding of the National Academy of Sciences. [10]. Grün, B., Hornik, K. (2011). topicsmodels: An R Package for fitting topic models. Journal of Statistical Software. [11]. Hubert, J.J. (1977). Bibliometric Models for Journal Productivity. Social Indicators Research. [12]. Lotka, A.J. (1926). The frequency of distribution of scientific productivity. Journal of the Washington Academy of Science. [13]. McRoberts, M.H., McRoberts, B.R. (1982). A Re-Evaluation of Lotka s Law of Scientific Productivity. Social Studies of Science. [14]. Newman, M.E.J. (2006). Power laws, Pareto distribution and Zipf s law. arxiv:condmat/0412004v3. 358

De Battisti, F., Salini, S. (2012). Electron. J. App. Stat. Anal., Vol. 5, Issue 3, 353 359. [15]. O Connor, D.O., Voos, H. (1981). Empirical Laws, Theory Construction and Bibliometrics. Library Trends. [16]. Potter, W.G. (1981). Lotka s Law Revisited. Library Trends. [17]. Price, D.J. De S. (1965). Networks of scientific papers. Science, 149, 510-515. [18]. Render, S. (1998). How popular is your paper? An empirical study of the citation distribution. The European Physical Journal B. [19]. Steyvers, M., Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis. [20]. Zipf, G. K. (1949). Human Behaviour and the Principle of Least Effort. Addison-Wesley, Cambridge. This paper is an open access article distributed under the terms and conditions of the Creative Commons Attribuzione - Non commerciale - Non opere derivate 3.0 Italia License. 359