Step 1: Select encyclopedias and articles for digitization Encyclopedias in the database are mainly chosen from the 19th and 20th century. Currently, we include encyclopedic works in the following languages: English, German, French, Russian, Chinese, Japanese, Korean. The copyright situation of each encyclopedia has to be checked. We contact and negotiate with publishers, try to find single authors, harvest the libraries for copyright-free works, and ask big libraries or private organization for permission to use their online materials. Of each encyclopedia, we digitize entries according to a list of 18 keywords, as well as the paratexts provided in the work itself (e.g. title page, prefaces, reading instructions, tables of contents, indices, etc.). As encyclopedias are mostly compiled to provide a comprehensive overview of contemporary knowledge on a certain topic, researchers can easily browse through the regional and historical dimension of certain concepts and of encyclopedic ways of storing this knowledge.
Step 2: Scan selected articles and paratexts We manually scan the selected articles and paratexts. In places like the Stadtarchiv Heidelberg or some smaller libraries, only copy machines are available, the copies are then scanned at our institute. In the scanning process, one needs to take care of the books, which are often more than 100 years old. Some books need special permission to scan or copy by the holding library. The actual condition of a book is of special interest for the project. For the future of the project, it is planned to have the books analyzed by criminologists, who employ chemical methods to analyse the history of the book. For the time being, and faced with the problem that the books under consideration are rare and precious, we only scan the books very carefully, and write notes about the condition into the digital text.
Step 3: Optical Character Recognition or Manual Typing Scanned pages are mere image files in the beginning. To draw out the text, a software for OCR (Optical Character Recognition) has to be employed. While this is fine for scans of good prints in European script, it is not working very well for Asian texts with complex characters, or handwritten texts. For Chinese and Japanese texts, we send our digital files to a company located in Beijing, where each character is manually typed. In addition to the character digitization, these companies do a first tagging of running heads, page breaks, headings, figures, paragraphs, highlightings, old character variants, etc. However, the texts need further processing with the use of scripts. It is done in a standardized way in cooperation with the Max-Planck-Institute for the History of Science, Berlin.
Step 4: Proof-reading, basic TEI mark-up Automatic text recognition is not sufficient to insert a text into the database. Proofreading is necessary for every single text, and the time needed for this step depends very much on the quality of the scan, as well as on the correct settings of the OCR programme. Basic mark-up has to be inserted according to a certain standard, which makes digital texts interchangable between different projects and platforms. In the Encyclopedia Database, the standard mark-up of the Text Encoding Initiative (TEI) is used to store certain information in the XML document. This includes basic mark-up as of paragraphs, highlighting, bibliographies, lists, encyclopedia entries with one head and one article, as well as project-specific rules for storing information like our association of each article with certain keywords.
Step 5: Content TEI mark-up and linking of contents to other resources In order to provide inter-lingual comparison and analysis, the contents of the texts can be marked and linked to resources outside and inside the database. This is done by tagging according to a list of things or concepts in which the researchers are interested. Currently, we mark up persons, people, places, events, time periods, institutions, religions, and languages. As manual tagging is a huge amount of work, we are now setting up a cooperation with the Department of Computational Linguistics at Heidelberg University. The aim is to develop methods for structured automatic harvesting and tagging of contents. As multi-lingual resource, we currently employ DBpedia.org, which is a structured information-base derived from data in Wikipedia. We are looking for ways to link our contents to YAGO2, a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames, developed by the Max-Planck-Institut Informatik, Munich.
Step 6: Database design and programming The database was designed in a process closely connected to the building of the corpus of texts. While the database is growing together with the data provided in the XML files and the requested functions, close contact with the programmers of existsolutions GmbH was and is still very important. Functions of the database are: - Browsing through different encyclopedias, entries, name tags, keywords. - Search for entries, names and terms, keywords, full text. - Viewing of single entries or paratexts, accompanied with a list of tagged names and terms in the text. - Display and inter-database links to metadata and analyses of the encyclopedic works. - Soon be implemented will be: display of images and original pages, extended comment function.
Step 7: Metadata, analyses, summaries, partial translations, notes, etc. Metadata helps the researchers to get an idea of what kind of work they are dealing with. In the database, metadata is given in three parts: 1) File Description: bibliographic data like author, editor, title, publisher, year, physical description, etc. 2) Profile Description: language, publication circumstances, persons involved, genre and style. 3) Analysis: Translations and analyses of prefaces, summaries of single articles, advertisements in and for the book, analyses of readership and the hidden grammars behind the works, secondary literature. The database is designed to be multi-lingual and interdisciplinary. However, the different languages may be an obstacle to a reseracher of a transcultural history of concepts and knowledge. We therefore decided to provide summaries or translations into English for key passages, especially the prefaces.
Step 8: Processing and linking of images and page scans The database will soon be able to show the scanned original page next to the tagged full text. This will allow users to combine research with original material and the conveniences of full text search and comparison. Furthermore, this feature will help corrections and document-specific tagging of the text during the input of new data into the XML files for the database. Another starting point for research in the project is the analysis of images that come with the encyclopedic text. Images and also advertisements may give hints on the intended readership of the books, and give insight into the depth of understanding or attitude towards certain topics, in particular if ideas are imported from a foreign culture.
Step 9: Use and enrich in teaching and research all over the world Winter term 2011/12: Summer term 2012: Reading Exercise: Reading Chinese Encyclopedias and related documents (Dr. Wang, Institute of Chinese Studies) Seminar: Migration of Knowledge in Encyclopedias (Prof. Mittler, Prof. Herren, Prof. Judge, Cluster Asia and Europe) Teaching and Research is the main aim of the database. Researchers, teachers and students can make use of the huge amount of digitized materials, and at the same time provide translations, analyses and other hints on the texts and documents. Researchers from all over the world are invited to participate in this open-source and fully access-free database. Current and planned courses are held in close connection with the project, being closely integrated into the development process of the database. Cooperation partners include researchers in Germany, Switzerland, Austria, England, India, China, Japan, Canada and other places.
Step 10: Share data with partners in Digital Humanities Inside the Heidelberg Research Architecture (HRA) of the Cluster Asia and Europe in a Global Context at Heidelberg University, the Encyclopedia Database is developed in close cooperation with other projects that establish databases with textual and visual material. Our project will not be limited to encyclopedias in the future, but is working towards integrating all kinds of digital texts, thus providing a database infrastructure with manifold functionality and interaction for other projects. Digital Humanities is a growing field gradually becoming more and more important. We are active to keep up with new developments, and to provide our results to the international community of researchers. The database is not only open-source, but the use of the TEI standard is chosen in order to make data interchangable between different projects. We invite all non-commercial usage of our data, and encourage cooperation partners to exchange their materials with us.