The use of bibliometrics in the Italian Research Evaluation exercises Marco Malgarini ANVUR MLE on Performance-based Research Funding Systems (PRFS) Horizon 2020 Policy Support Facility Rome, March 13, 2017
Background and motivation A first evaluation exercise in Italy has been performed in 2006 with reference to the period 2001-2003 Results were limitedly used for funding purposes A new exercise was launched by the Italian ministry of education and research in 2011, with reference to the period 2004-2010 (VQR 2004-2010) A new independent agency, ANVUR,was in charge of managing the exercise Results of the first VQR were presented in July 2013 2
Background and motivation A new VQR referred to the period 2011-14 has been launched by the ministry in June, 2015; the exercise involves 96 Universities, 12 PRO s and 27 other research bodies The reporting unit is the individual researcher, who has to submit two publications for evaluation (three for PRO s) 118.036 publications have been evaluated by 16 Groups composed by 436 experts, who appointed 12.731 peer reviewers ANVUR has presented evaluation results on February, 21 st, 2017 Results have been elaborated at the Institutional and Department level; no informationis published at the individual level 3
Background and motivation The implicit goal of the evaluation is to increase the average quality of research activities in Italy The main goals declared by ANVUR are: To inform the ministry with the necessary indicators to be used in distributing up to 20% of total university financing To provide the university management with relevant information for the governance of theuniversity system To provide students, households and young researchers with relevant information in order to guide their personal choices All performance indicators used in the evaluation are size-dependent: hence, the distribution of funding isnot substantially altered by the exercise Nevertheless, there is ample evidence that the introduction of the exercise has steered the behavior of the actors involved 4
Background and motivation Performance indicators used in the exercise are: Quality of publications as assessed by a system of informed peer review, with a combined used of peer evaluation and bibliometric indicators (see below) Quality of publications of new hires and people promoted in the period considered Number of doctoral students and post-doc researchers External competitive funding Increase/decrease of research quality with respect to the previous evaluation exercise For funding purposes the Ministry mainly uses the indicators concerning research quality, the number of doctoral students and external competitive funding 5
Background and motivation Evaluation is referred to 16 evaluation areas: 01 Mathematics and computer sciences; 02 Physics; 03 Chemistry; 04 Earth Sciences; 05 Biology; 06 Medicine; 07 Agricultural and Veterinary Sciences; 08a Architecture 08b Civil Engineering 09 Industrial and information engineering 10 Antiquities, Philology, Literary Studies, Art History; 11a History, Philosophy, Pedagogy 11b Psychology 12 Law studies 13 Economics and Statistics 14 Political and Social Sciences. 6
Data sources, indicators and overall design Criteria against which publications are evaluated are: a) originality, to be intended as the degree according to which the publication is able to introduce a new way of thinking about the object of the research; b) Methodological rigor, to be intended as the degree according to which the publication adopts an appropriate methodology and is able to present its results to peers; c) Actual or potential impact, to be intended as the level of influence current or potential that the research exerts on the relevant scientific community 7
Publications admitted for evaluation are: Books Articles and review essays Books chapters Other scientific publications, including compositions, designs, projects (architecture), performances, exhibitions, arts objects, databases and software Patents Data sources, indicators and overall design 8
Data sources, indicators and overall design 9
Data sources, indicators and overall design 10
Data sources, indicators and overall design 11
Data sources, indicators and overall design Publications are classified in one of the following five evaluation classes (six if we consider also paper that can not be evaluated for various reasons): Excellent (weight 1): publications ideally considered in the top 10% of the world distribution for originality and methodological rigour and having a strong impact in the scientific community in the area. Good (weight 0.7): publications ideally considered in the 10-30% segment of the world distribution for originality and methodological rigour and having a relevant impact in the scientific community in the area. Fair (weight 0.4): publications ideally considered in the 30-50% segment of the world distribution for originality and methodological rigour and having a fair impact in the scientific community in the area. Acceptable (weight 0.1): publications ideally considered in the 50-80% segment of the world distribution for originality and methodological rigour and having an acceptable impact in the scientific community in the area Limited (weight 0): publications ideally considered in the bottom 80-100% segment of the world distribution for originality and methodological rigour and having a limited impact in the scientific community in the area or Not valuable (weight 0): publications deemed impossible to evaluate for lack of documentations or for not being designed for evaluation (not a scientific product, or not published outside of the evaluation period) Evaluations are aggregated at the University/Department level in order to assess the overall research quality of the Institution 12
Data sources, indicators and overall design Evaluation is based on a system of informed peer review In STEM areas and, to some extent, in Economics and Statistics, peer evaluation is integrated with the use of bibliometric indicators concerning citations and journals impact, extracted from the ISI/Web of Science and Scopus databases. In HSS (with the only partial exception of economics and statistics), evaluation is based purely on peer review Overall, more than 50% of the publications submitted for evaluation was subject to peer review 13
Data sources, indicators and overall design 14
Bibliometric evaluation Object of the bibliometric evaluation: Articles published in journal indexed in either Scopus or ISI-Web of Science database We use two different indicators: An indicator of journal impact IF5Y, ArticleInfluenceScore in WoS IPP, SJR in Scopus Number of citations as of February 29 th, 2016 The choice concerning the database and the impact indicator to be used in the evaluation was left to the researcher 15
On the use of journal impact indicators The use of journal impact indicators in evaluation is often criticized in the literature (Dora declaration, Leiden Manifesto) However, journal impact indicators are increasingly considered as an acceptable proxy for article quality, if they are used with the utmost care: Abramo et al (2010) and Levitt and Thelwall (2011) argued that impact factor can be a useful tool for the evaluation of recent articles In the recent analysis of REF2014 results, HEFCE (2015) shows that journal impact indicators are well correlated with REF quality profiles based on peer review Waltman and Traag (2017) are now arguing that journal level indicators can be a more accurate representation of the quality of articles than citations, if we consider that quality of articles published in a journal as rather homogenous 16
On the use of journal impact indicators On the basis of the available evidence, we conclude that the use of journal metrics should be handled with care, but can not be excluded in massive research evaluation exercise Possible recommendations for the use of journal metrics are: Use more than one journal metrics (IF, Eigenfactor, SJR) Use journal metrics in combination with article-level metrics and evaluate their coherence Always normalize with respect to the scientific field and year of publication 17
The bibliometric algorithm The bibliometric evaluation of an article is determined by the combined used of the two indicatorsfor citations and journal impact We developed an algorithm to be applied specifically to each subject category, year and type of publication (distinguishing among journal articles, letters and reviews) The algorithm proceeds as follows for each article: Calculate the empirical cumulative distribution of the number of citations relative to all the world articles in eachsubject category/year/type of publication Calculate the empirical cumulative distribution of the Journal impact indicator relative to all world journals in eachsubject category/year In this way, we identify two distinct percentiles for number of citations and impact indicator of each article submitted for evaluation The two percentiles identify a point in the region of the Cartesian plane Q = [0,1]x[0,1], delimited by the JM percentile of the journal (X axis) and by the percentile of the citations (Y axis). 18
The bibliometric algorithm Q should hence be divided in five zones or regions that follow the percentage of articles belonging to each region as defined in the VQR call (top 10%; 11-30%; 31-50%; 51-80% 81-100%) The partition is realised using simple straight lines identified by the following linearequation : CIT = A IF + B n We assume that the angular coeffient A is equal for the three lines; its value is determined by the experts The intercepts Bn are calculated by ANVUR given A and depending on the specific distribution of the SC, so as to ensure that considering the world distribution of articles the percentage indicated in the VQR call are always respected 19
There maybe borderline cases if: The bibliometric algorithm Articles published in high prestige journals are scarcely cited Articles published in low impact journals have a high citation impact (shaded areas in the figure). In such cases, the article is sent to peer review All articles published in 2014 that do not fall in the Excellent class are generally evaluated in peer review Experts panel also considered the role of selfcitations: if they exceed 50% of the total number of citations, the paper was carefully examined by the expert and eventually sent to peer review 20
The bibliometric algorithm Different values of A give different weights to citations or journal impact: if A>1 journal impactismore importantthan citations, and viceversa Generally speaking, the value of A has been chosen such as A<1; however, its value is generally higher for recent publications CIT is more important IF is more important E E CIT CIT IF «Old articles» IF «Recent articles» 21
Example Phisics GEV2 2007 m=0.5 ventili IF Citazioni 0 0.314 0 5 0.653 0 10 0.712 1 15 1.025 1 20 1.747 2 25 1.78 2 30 2.026 3 35 2.026 3 40 2.132 4 45 2.325 5 50 2.325 6 55 2.325 6 60 2.483 7 65 2.483 9 70 2.483 10 75 2.483 12 80 2.483 14 85 2.483 16 90 2.483 20 95 3.07 28 100 9.471 198 22
How informed peer review works in practice Peer Review YES Article Expert #1 Expert #2 Bibliometric evaluation. Informed review? NO Experts evaluation of bibliometric results Consensus group Approvement of the Coordinator Final approvement If experts change the bibliometric evaluation, they have to justify it 23
Effects of the informed peer review GEV # articles with evaluation modified by % the expert 1 5,905 529 8.96 2 10,595 469 4.43 3 7,023 376 5.35 4 4,402 527 11.97 5 10,941 507 4.63 6 17,173 396 2.31 7 7,496 416 5.55 8a 3,507 597 17.02 8b 2,806 225 8.02 9 11,447 1,568 13.70 10 8,727 617 7.07 11a 5,948 736 12.37 11b 2,289 261 11.40 12 8,495 726 8.55 13 8,302 383 4.61 14 2,980 322 10.81 Total 118036 8655 7.33 24
Effects of bibliometrics used in PRFS The results of the second VQR has been published less than a month ago, so the analysisof the possible effects on the Italian system has not started yet However, we can point out some first preliminary evidence looking both at some detail of the data we gathered in performing the exercise and at the general performance of the Italian University system in the internationalscenario Concerning the first point, looking at the data of the new VQR with respect to the first one, we noticed: An increase in journal articles as a mean of disseminating knowledge, both in areas evaluated with bibliometric indicators and in those where we mainly used peer review An increase in the use of the english language, especially in HSS (English was alreadythe largelypredominantlanguagein STEM) Some evidence of convergence in evaluation results between the north and the south ofthe country On the other hand, the persistence of strong performance differentials among the two geographical areas 25
Effects of bibliometrics used in PRFS The following graph presents the distribution of evaluation results for Italian Universities in the firstand second VQR Results are standardized so as the national average is equal to zero, and are also corrected for size We can observe a reduction in the standard deviation of the distribution, that is confirmed by an F test on the variance ratio However, the next slide shows that the Universities that are gaining from the evaluation exercise with respect to a distribution of funding purely based on size (blue dots) are still mostly concentrated in the North of the country, while red dots (universities that are losing with respect to an allocation of funding based on size) are mostly in the South 26
Effects of bibliometrics used in PRFS 27
Effects of bibliometrics used in PRFS 28
Effects of bibliometrics used in PRFS To isolate the effect of PRFS on productivity and production quality is not an easy task In fact it is very difficult to disentangle the impact of PRFS from the general trends internationally observed in terms of production and impact However, we can observe that in the last 15 years Italy has indeed increased its role in terms of scientific production indexed in the major international databases 29
Effects of bibliometrics used in PRFS Citation impact is also growing as measured by the Field Weighted Citation Impact 30
Effects of bibliometrics used in PRFS The share of Italian articles comprised in the top 10% of the world distribution in terms of citations is also growing 31
Open issues Some issues have however emerged from the analysis, and they should be addressed in view of the next evaluation exercise, that according to the Italian law will take place in 2020 with reference to the period 2015-2019 Currently, researchers operating in different scientific areas has to submit the same number of articles for evaluation (2 for each researcher); however, there are huge differences in the average produtctivity and the average number of authors among the areas, and this may be taken into account in the design of the next PRFS Similarly to what has already been mentioned for Norway, the results are also used locally, and in some cases in context that are not fully appropriated. This may be followed up establishing appropriate guidelines for proper managerial use of the results at the local level The Italian system is still lacking an official CRIS containing all the research products of Italian researchers; the development of such a system would highly enhance and simplify the funtioning of national research evaluation exercises 32