MSc Projects Information Searching Peter Hancox Computer Science
Why should you be searching? Information searching/retrieval is about: saving you time by finding ways to solve problems, produce better designs, discover problem domains, benchmark your work; learning from other people s work; developing problem-solving skills; keeping your examiners happy. Introduction to information retrieval 1
IR is more than Google What were the surnames of my grandfathers? Most people s way of finding information is to use Google. Is the Earth flat? See: http://www.alaska.net/~clund/e_djublonskopf/flatearthsociety.htm Google is great for some things - mainly finding undisputed facts. Google relies on indexing WWW pages - so it is as: complete as the WWW; accurate as the WWW. Introduction to information retrieval 2
The Computer Science literature 3% 7% Conference papers Journal articles 17% 39% Technical reports Theses 3% 5% 26% Books Other material (eg programs) WWW pages Introduction to information retrieval 3
How to know you re getting quality How do you, a novice, know you re reading high quality scientific/technical literature? Journal papers - the best are peer-reviewed Conference papers - the best are peer-reviewed Books - the best are published by the best publishers, e.g. Oxford, Cambridge, MIT,. Technical reports - the best probably come from the best universities and companies But how do you know which are the best? Introduction to information retrieval 4
How to know you re getting quality The answer is very simple: Use specialised information retrieval databases that have excellent coverage; excellent currency. Introduction to information retrieval 5
Some IR theory and practice - 1 There are three kinds of search: finding simple facts - use Google (with care) current awareness - keeping yourself up-to-date retrospective searching - finding some (or all) the literature on a topic This lecture is mainly about retrospective searching. Introduction to information retrieval 6
Some IR theory and practice - 2 A document set can be divided into relevant and irrelevant documents: Introduction to information retrieval 7
Some IR theory and practice - 3 A document set can be divided into relevant and irrelevant documents: Precision = no. of relevant documents total no. of docs retrieved 100/160 = 62.5% Introduction to information retrieval 8
Some IR theory and practice - 4 A document set can be divided into relevant and irrelevant documents: Recall = no. of relevant documents total no. of relevant docs 100/200 = 50% Introduction to information retrieval 9
Some IR theory and practice - 5 The paradox of searching? It seems impossible to get 100% precision and 100% recall. Introduction to information retrieval 10
Some IR theory and practice - 6 Bradford s law of scattering: Colloquially: To find all relevant scientific literature on a topic, you have to look in all the literature; to find ~90% of the literature, you only have to look in 10% of the literature. More formally: the returns of extending a search for references in science journals diminishes exponentially. Introduction to information retrieval 11
Some IR theory and practice - 7 Bradford s law of scattering: Means that we can concentrate searching on a fairly small subset of the literature and get most results. Specialised information retrieval databases are designed to retrieve large amounts of literature from the optimum number of journals. Google isn t designed to do this. Introduction to information retrieval 12
Choosing databases - books Don t use Amazon - it only has books currently on sale that it can source. Use a copyright deposit library: British Library Library of Congress Cambridge UL Databases - books 13
Choosing databases journals and conference papers The best keyword-based Computer Science services are: Inspec http://www.engineeringvillage2.org ACM Guide to Computing Literature http://portal.acm.org/guide.cfm Databases - journals and conference papers 14
Choosing databases journals and conference papers Interdisciplinary services with substantial Computing coverage: Medline http://gateway.ovid.com/autologin.html Compendex/Engineering Index http://www.engineeringvillage2.org Databases - journals and conference papers 15
Choosing databases journals and conference papers Single publisher services - perhaps with full text access: IEEE Xplore http://ieeexplore.ieee.org/ Databases - journals and conference papers 16
Inspec - coverage and currency Includes: 3,500 journals many of them computing science and applications journals conference papers 1,500 conferences added each year seems to include reports, theses, etc, but how satisfactory is the coverage? Journals seem to be completely indexed within ~6 months of publication. Searching Inspec 17
Inspec - indexing How are the entries indexed? Classification scheme Controlled language Keywords taken from title taken from abstract written by the indexer Searching Inspec 18
Inspec - indexing ti: coherence ti: inference ti: representation Searching Inspec 19
Inspec - indexing Searching Inspec 20
Inspec - searching Demonstration based on handout. Searching Inspec 21
Science Citation Index Subject coverage The scope is so wide as to be multidisciplinary. It indexes: journals - almost 5,300 science journals including at least 200 computing journals and probably more. It doesn t directly index: conferences books reports theses Inspec indexes 3,500 mainly relevant journals Searching Science Citation Index 22
Science Citation Index Comprehensiveness/coverage Covers many of the principal journals in computing has a wide computer science coverage, choosing the most widely respected journals rather than (e.g.) an engineering bias. Searching Science Citation Index 23
Science Citation Index Subject overlap SCI overlaps with several other indexing services. Compendex has many of the same core journals - but also has conferences. Inspec has many of the same core journals and lots of other journals - and also has conferences. Searching Science Citation Index 24
Science Citation Index Record content How much information do the entries contain? Basic bibliographic information Abstract Institution - e.g. University of Birmingham Language of original article Searching Science Citation Index 25
Science Citation Index Indexing How are the entries indexed? Keywords taken from title taken from abstract written by the indexer Citations Searching Science Citation Index 26
SCI - searching Demonstration based on handout. Searching Science Citation Index 27
So what does SCI retrieve? If you use it as keyword-based index Only journals/serials If you search for citations Anything that authors cite journal & conference papers, books, theses, technical reports letters, WWW pages, newspapers, conversations Searching Science Citation Index 28
Searching SCI for citations Points to think about Does the use of citations improve recall and/or precision? What criteria are used to include cited items? Are items cited because they are relevant? Because the author wrote them? To criticize an alternative approach? To impress readers with the author s erudition? Searching Science Citation Index 29
The End 30