ENCYCLOPEDIA DATABASE

Similar documents
Digital Text, Meaning and the World

Digital Humanities from the Ground Up: The Tamil Digital Heritage Project at the National Library, Singapore

Laurent Romary. To cite this version: HAL Id: hal

Welsh print online THE INSPIRATION THE THEATRE OF MEMORY:

Research question. Approach. Foreign words (gairaigo) in Japanese. Research question

German UDC Translation Project

HIST The Middle Ages in Film: Angevin and Plantagenet England Research Paper Assignments

Enhancing Music Maps

ELECTRONIC JOURNALS LIBRARY: A GERMAN

Introduction to Citation Management Software

ManusOnLine. the Italian proposal for manuscript cataloguing: new implementations and functionalities

CHAPTER OBJECTIVES - STUDENTS SHOULD BE ABLE TO:

LMS301: Reference Management Software (Mendeley)

How to find a book. To locate a book in the library, Search the NJIT catalog first. Use Basic or Advanced Search

Development of Classical Tamil Digital Library: CIIL Experience. Abstract

WORLD LIBRARY AND INFORMATION CONGRESS: 75TH IFLA GENERAL CONFERENCE AND COUNCIL

Book Indexes p. 49 Citation Indexes p. 49 Classified Indexes p. 51 Coordinate Indexes p. 51 Cumulative Indexes p. 51 Faceted Indexes p.

Support, Distribution: VERBI Software. Consult. Sozialforschung. GmbH Berlin, Germany.

Growth of Literature and Collaboration of Authors in MEMS: A Bibliometric Study on BRIC and G8 countries

CLARIN - NL. Language Resources and Technology Infrastructure for the Humanities in the Netherlands. Jan Odijk NO-CLARIN Meeting Oslo 18 June 2010

NDL s Digital Collection and Service for Information Access

CONTRIBUTION OF INDIAN AUTHORS IN WEB OF SCIENCE: BIBLIOMETRIC ANALYSIS OF ARTS & HUMANITIES CITATION INDEX (A&HCI)

Propylaeum: Virtual Library Classical Studies Egyptology

2009 CDNLAO COUNTRY REPORT

INDUSTRY OVERVIEW. Global Demand for Paper and Paperboard: Million tonnes. Others Latin America Rest of Asia. China Eastern Europe Japan

Guidelines for Seminar Papers and BA/MA Theses

CONCEPT FOR A COMMON EUROPEAN HRS AVAILABILITY SYSTEM

"Libraries - A voyage of discovery" Connecting to the past newspaper digitisation in the Nordic Countries

CLARIN AAI Vision. Daan Broeder Max-Planck Institute for Psycholinguistics. DFN meeting June 7 th Berlin

A Proposal For a Standardized Common Use Character Set in East Asian Countries

China National Bibliography at the Crossroad. Ben Gu ( 顧犇 ) National Library of China

SUBJECT INDEXING: A LITERATURE SURVEY AND TRENDS

Thailand Country Report May 2012 Bali, Indonesia

Major Chinese Full-Text Electronic Information Resources for Researchers and Scholars

Steps in the Reference Interview p. 53 Opening the Interview p. 53 Negotiating the Question p. 54 The Search Process p. 57 Communicating the

Catalogs, MARC and Other Metadata

AMERICA S CASTLES. 5. Be sure all four margins are set to 1 (Step 1 in the MLA Document).

USING LIVE PRODUCTION SERVERS TO ENHANCE TV ENTERTAINMENT

On the Development of the Institute of Chinese Studies Library at Heidelberg University

A BIBLIOMETRIC ANALYSIS OF ASIAN AUTHORSHIP PATTERN IN JASIST,

Digitization : Basic Concepts

The Library Reference Collection: What Kinds of Materials will you find in the Reference Collection?

Annual Report of the IFLA-PAC China Center

McGill-Harvard-Yenching Library Joint Digitization Project: Ming-Qing Women's Writings

Charters Encoding Initiative Overview

How to write a seminar paper An introductory guide to academic writing

Bibliometric Study on LIS Journals Archived in DOAJ

Instructions for authors

Cooperation between Turkish researchers and Oxford University Press. Avanos October 2017

MLA ANNOTATED BIBLIOGRAPHIES. For use in your Revolutionary Song projects

from physical to digital worlds Tefko Saracevic, Ph.D.

Information Standards Quarterly

DE GRUYTER - UPDATE 3RD ANNUAL EAZY DAYS 2016 MİRİAM DE LA ROCHEFORDİERE DENİZ CAN KURT OCTOBER 3RD, DG October 2016

Identifying Related Documents For Research Paper Recommender By CPA and COA

Metonymy Research in Cognitive Linguistics. LUO Rui-feng

And How to Find Them! Information Sources

CESL Master s Thesis Guidelines 2016

Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Humanities Learning Outcomes

Editing for man and machine

SERIES H: AUDIOVISUAL AND MULTIMEDIA SYSTEMS Infrastructure of audiovisual services Coding of moving video

New Technologies in Russian Cartographic Libraries

East and South-East Asian History. Le Royaume du Cambodge

Content or Discontent? Dealing with Your Academic Ancestors

STATEMENT OF INTERNATIONAL CATALOGUING PRINCIPLES

Susan K. Reilly LIBER The Hague, Netherlands

List of Contributors General Reference p. 1 Bibliographic Guides p. 1 Biography p. 2 Directories p. 4 Encyclopedias p. 5 Handbooks, Almanacs, and

ARCHIVAL DESCRIPTION GOOD, BETTER, BEST

An Analysis of English Translation of Chinese Classics from the Perspective of Cultural Communication

Digital Terrestrial HDTV Broadcasting in Europe

Web of Science Unlock the full potential of research discovery

Introduction to Research Department of Metallurgical and Materials Engineering Indian Institute of Technology, Madras

The College Student s Research Companion:

How to Write Technical Reports

ICDL FAQS FOR REVISED 3/18/05. What is the International Children s Digital Library (ICDL)? Who is the intended audience for the ICDL?

British National Corpus

NIAS AsiaPortal. Naomi Yabe Magnussen Ane Husstad-Nedberg. Japanese and Korean resources. NIAS (Nordic Institute of Asian Studies)

Programme Specification

WP6- Analysis in the Visual Domain

ITU-T Y Functional framework and capabilities of the Internet of things

ISO INTERNATIONAL STANDARD. Bibliographic references and source identifiers for terminology work

Bibliometric analysis of publications from North Korea indexed in the Web of Science Core Collection from 1988 to 2016

Frontiers of Optoelectronics Instruction for Authors

ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING (PRS)

Author Workshop: A Guide to Getting Published

The Joint Transportation Research Program & Purdue Library Publishing Services

Writing Styles Simplified Version MLA STYLE

Lyricon: A Visual Music Selection Interface Featuring Multiple Icons

Cataloging Principles: IME ICC

History of East Asia I. TTh 1:30-2:50 ATG 123

DRAFT UC VENDOR/SHARED CATALOGING STANDARDS FOR AUDIO RECORDINGS JUNE 4, 2013 EDIT

Metadata FRBR RDA. BIBLID (2008) 97:1 p (2008.6) 1

what body what body The what The

Read And Write Chinese: A Simplified Guide To The Chinese Characters By Rita Mei-Wah Choy READ ONLINE

Introduction to Mendeley

Guide for Authors. The prelims consist of:

LIST OF PUBLISHED STANDARDS

Journal of Japan Academy of Midwifery Instructions for Authors submitting English manuscripts

The Chicago. Manual of Style SIXTEENTH EDITION. The University of Chicago Press CHICAGO AND LONDON

Guidance for preparing

Transcription:

Step 1: Select encyclopedias and articles for digitization Encyclopedias in the database are mainly chosen from the 19th and 20th century. Currently, we include encyclopedic works in the following languages: English, German, French, Russian, Chinese, Japanese, Korean. The copyright situation of each encyclopedia has to be checked. We contact and negotiate with publishers, try to find single authors, harvest the libraries for copyright-free works, and ask big libraries or private organization for permission to use their online materials. Of each encyclopedia, we digitize entries according to a list of 18 keywords, as well as the paratexts provided in the work itself (e.g. title page, prefaces, reading instructions, tables of contents, indices, etc.). As encyclopedias are mostly compiled to provide a comprehensive overview of contemporary knowledge on a certain topic, researchers can easily browse through the regional and historical dimension of certain concepts and of encyclopedic ways of storing this knowledge.

Step 2: Scan selected articles and paratexts We manually scan the selected articles and paratexts. In places like the Stadtarchiv Heidelberg or some smaller libraries, only copy machines are available, the copies are then scanned at our institute. In the scanning process, one needs to take care of the books, which are often more than 100 years old. Some books need special permission to scan or copy by the holding library. The actual condition of a book is of special interest for the project. For the future of the project, it is planned to have the books analyzed by criminologists, who employ chemical methods to analyse the history of the book. For the time being, and faced with the problem that the books under consideration are rare and precious, we only scan the books very carefully, and write notes about the condition into the digital text.

Step 3: Optical Character Recognition or Manual Typing Scanned pages are mere image files in the beginning. To draw out the text, a software for OCR (Optical Character Recognition) has to be employed. While this is fine for scans of good prints in European script, it is not working very well for Asian texts with complex characters, or handwritten texts. For Chinese and Japanese texts, we send our digital files to a company located in Beijing, where each character is manually typed. In addition to the character digitization, these companies do a first tagging of running heads, page breaks, headings, figures, paragraphs, highlightings, old character variants, etc. However, the texts need further processing with the use of scripts. It is done in a standardized way in cooperation with the Max-Planck-Institute for the History of Science, Berlin.

Step 4: Proof-reading, basic TEI mark-up Automatic text recognition is not sufficient to insert a text into the database. Proofreading is necessary for every single text, and the time needed for this step depends very much on the quality of the scan, as well as on the correct settings of the OCR programme. Basic mark-up has to be inserted according to a certain standard, which makes digital texts interchangable between different projects and platforms. In the Encyclopedia Database, the standard mark-up of the Text Encoding Initiative (TEI) is used to store certain information in the XML document. This includes basic mark-up as of paragraphs, highlighting, bibliographies, lists, encyclopedia entries with one head and one article, as well as project-specific rules for storing information like our association of each article with certain keywords.

Step 5: Content TEI mark-up and linking of contents to other resources In order to provide inter-lingual comparison and analysis, the contents of the texts can be marked and linked to resources outside and inside the database. This is done by tagging according to a list of things or concepts in which the researchers are interested. Currently, we mark up persons, people, places, events, time periods, institutions, religions, and languages. As manual tagging is a huge amount of work, we are now setting up a cooperation with the Department of Computational Linguistics at Heidelberg University. The aim is to develop methods for structured automatic harvesting and tagging of contents. As multi-lingual resource, we currently employ DBpedia.org, which is a structured information-base derived from data in Wikipedia. We are looking for ways to link our contents to YAGO2, a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames, developed by the Max-Planck-Institut Informatik, Munich.

Step 6: Database design and programming The database was designed in a process closely connected to the building of the corpus of texts. While the database is growing together with the data provided in the XML files and the requested functions, close contact with the programmers of existsolutions GmbH was and is still very important. Functions of the database are: - Browsing through different encyclopedias, entries, name tags, keywords. - Search for entries, names and terms, keywords, full text. - Viewing of single entries or paratexts, accompanied with a list of tagged names and terms in the text. - Display and inter-database links to metadata and analyses of the encyclopedic works. - Soon be implemented will be: display of images and original pages, extended comment function.

Step 7: Metadata, analyses, summaries, partial translations, notes, etc. Metadata helps the researchers to get an idea of what kind of work they are dealing with. In the database, metadata is given in three parts: 1) File Description: bibliographic data like author, editor, title, publisher, year, physical description, etc. 2) Profile Description: language, publication circumstances, persons involved, genre and style. 3) Analysis: Translations and analyses of prefaces, summaries of single articles, advertisements in and for the book, analyses of readership and the hidden grammars behind the works, secondary literature. The database is designed to be multi-lingual and interdisciplinary. However, the different languages may be an obstacle to a reseracher of a transcultural history of concepts and knowledge. We therefore decided to provide summaries or translations into English for key passages, especially the prefaces.

Step 8: Processing and linking of images and page scans The database will soon be able to show the scanned original page next to the tagged full text. This will allow users to combine research with original material and the conveniences of full text search and comparison. Furthermore, this feature will help corrections and document-specific tagging of the text during the input of new data into the XML files for the database. Another starting point for research in the project is the analysis of images that come with the encyclopedic text. Images and also advertisements may give hints on the intended readership of the books, and give insight into the depth of understanding or attitude towards certain topics, in particular if ideas are imported from a foreign culture.

Step 9: Use and enrich in teaching and research all over the world Winter term 2011/12: Summer term 2012: Reading Exercise: Reading Chinese Encyclopedias and related documents (Dr. Wang, Institute of Chinese Studies) Seminar: Migration of Knowledge in Encyclopedias (Prof. Mittler, Prof. Herren, Prof. Judge, Cluster Asia and Europe) Teaching and Research is the main aim of the database. Researchers, teachers and students can make use of the huge amount of digitized materials, and at the same time provide translations, analyses and other hints on the texts and documents. Researchers from all over the world are invited to participate in this open-source and fully access-free database. Current and planned courses are held in close connection with the project, being closely integrated into the development process of the database. Cooperation partners include researchers in Germany, Switzerland, Austria, England, India, China, Japan, Canada and other places.

Step 10: Share data with partners in Digital Humanities Inside the Heidelberg Research Architecture (HRA) of the Cluster Asia and Europe in a Global Context at Heidelberg University, the Encyclopedia Database is developed in close cooperation with other projects that establish databases with textual and visual material. Our project will not be limited to encyclopedias in the future, but is working towards integrating all kinds of digital texts, thus providing a database infrastructure with manifold functionality and interaction for other projects. Digital Humanities is a growing field gradually becoming more and more important. We are active to keep up with new developments, and to provide our results to the international community of researchers. The database is not only open-source, but the use of the TEI standard is chosen in order to make data interchangable between different projects. We invite all non-commercial usage of our data, and encourage cooperation partners to exchange their materials with us.