OCR of Historical Printings of Latin Texts

Similar documents
Ground Truth for training OCR engines on historical documents in German Fraktur and Early Modern Latin

Guideline: Transcription

Charters Encoding Initiative Overview

Citing Poetry for Students using NoodleTools

WG2: Transcription of Early Letter Forms Brian Hillyard

The GERMANA database

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes

The Chicago. Manual of Style SIXTEENTH EDITION. The University of Chicago Press CHICAGO AND LONDON

Digital Humanities from the Ground Up: The Tamil Digital Heritage Project at the National Library, Singapore

Instructions to authors for publication of articles in the Cahiers de recherches médiévales et humanistes (For all articles written in English)

The ACL Anthology Network Corpus. University of Michigan

INTERNATIONAL TRIBUNAL FOR THE LAW OF THE SEA

Word Sense Disambiguation in Queries. Shaung Liu, Clement Yu, Weiyi Meng

Portfolio Checklist for Multi-year Renewals

WRITING ASSIGNMENTS AND PROJECT REPORTS.

Lucan and the Sublime

WordCruncher Tools Overview WordCruncher Library Download an ebook or corpus Create your own WordCruncher ebook or corpus Share your ebooks or notes

APA Research Paper Guidelines

AudioRadar. A metaphorical visualization for the navigation of large music collections

Modules Multimedia Aligned with Research Assignment

CHICAGO DEMOTIC DICTIONARY (CDD)

Reminders: NHD Websites

www. enocean. com EnOcean Brand Guidelines

Digital Books Program Contract

ENCYCLOPEDIA DATABASE

NOTTINGHAM FRENCH STUDIES NOTES FOR CONTRIBUTORS

Analysis of the Occurrence of Laughter in Meetings

GUIDELINES FOR AUTHORS

Early printed edition and OCR techniques: what is the state-of-art? Strategies to be developed from the working-progress Mambrino project work

The Library Of Greek Mythology (Oxford World's Classics) [Kindle Edition] By Apollodorus READ ONLINE

CAMPAIGN TAGLINE GUIDELINES

Portfolio Checklist for Fixed-term Promotions

EasyChair Preprint. How good is good enough? Establishing quality thresholds for the automatic text analysis of retro-digitized comics

Quest 1: Look up your topic in a reference book

Guidelines for Contributors. Submission Submissions should be sent electronically as an attached document to the Editor,

The following rules cannot answer all style questions but act as a guideline. particular form: intifada people of different sexual orientation

INFORMATIONAL ABSTRACTS

1/29/2008. Announcements. Announcements. Announcements. Announcements. Announcements. Announcements. Project Turn-In Process. Quiz 2.

Announcements. Project Turn-In Process. Project 1A: Project 1B. and URL for project on a Word doc Upload to Catalyst Collect It

PROPOSAL SUMMARY FORM

In the footnote: 1. Ralph Fiennes, quoted in Mark Brown, Ralph Fiennes: Michael Gove is just like Richard III, The Guardian, 19 July 2016.

Adjust oral language to audience and appropriately apply the rules of standard English

Instructions to Contributors

Climbing the Tower of Babel Challenges and Opportunities in Multilingual Data for the Digital Humanities

Laurent Romary. To cite this version: HAL Id: hal

RDA Ahead: What s In It For You? Lori Robare OVGTSL May 4, 2012

UC Irvine Unicode Project

2 o Semestre 2013/2014

BUILDING A SYSTEM FOR WRITER IDENTIFICATION ON HANDWRITTEN MUSIC SCORES

APPLY YOUR APA! A how-to guide on APA format

HumaReC SNF project , Report #1 for the editorial and scientific board Revision History

Running head: EXAMPLE APA STYLE PAPER 1. Example of an APA Style Paper. Justine Berry. Austin Peay State University

Book Review Guidelines for H-Soz-Kult

Formatting a Document in Word using MLA style

Logo Standards Manual E X T E R N A L U S E A N D F U N D I N G C R E D I T G U I D E L I N E S

Typography Day Typography and Culture

Chemistry International. An international peer-reviewed journal.

Ancient Philosophy Today Style guide

The digitized Newspaper Collection as National Patrimony of the Russian Federation

WHITEPAPER. Customer Insights: A European Pay-TV Operator s Transition to Test Automation

MASTER S DISSERTATION PRESENTATION GUIDELINES 2016/17

Announcements. Project Turn-In Process. and URL for project on a Word doc Upload to Catalyst Collect It

USC Dornsife Spatial Sciences Institute Master s Thesis Style Guide Effective for students in SSCI 594a as of Fall 2016

Alyssa Grieco. Cataloging Manual Descriptive and Subject Cataloging Guidelines

DR. ABDELMONEM ALY FACULTY OF ARTS, AIN SHAMS UNIVERSITY, CAIRO, EGYPT

How to write a seminar paper An introductory guide to academic writing

NEW QUERY-BY-HUMMING MUSIC RETRIEVAL SYSTEM CONCEPTION AND EVALUATION BASED ON A QUERY NATURE STUDY

Annotated Bibliography

[COE STYLE GUIDE FOR THESES AND DISSERTATIONS]

Writing Research Essays:

EndNote Essentials. EndNote Overview PC. KUMC Dykes Library

PLAGIARISM! What is it? How do I avoid it?

Arkansas Learning Standards (Grade 10)

MASTER OF INNOVATION AND TOURISM MARKETING (MIT)

GUIDE TO WRITING SCIENTIFIC PAPERS (BACHELOR- AND MASTER THESES, SEMINAR PAPERS)

Citations and Annotations in Classics:Old Problems and New Per

Report. General Comments

General Course information for. Primate Biology

UNMANNED AERIAL SYSTEMS

RDA Examples Guide. 3 LANGUAGE AND SCRIPT CAPITALIZATION Examples data Explanatory notes... 5

Technology in preservation of the national heritage

University Marks 2.1. Institutional Logo Overview

Global Philology Open Conference LEIPZIG(20-23 Feb. 2017)

MIDDLETOWN HIGH SCHOOL SUMMER READING

Of Mice and Men Obituary & Eulogy

Digital Text, Meaning and the World

The ISBN number is a 10-digit number consisting of 4 groups, each separated by a hyphen:

Guide to contributors. 1. Aims and Scope

Table of Contents. Brand Overview. Logo Versions. Standard Logos. How to Use. Colors. Typography. Logo Usage. Misuses. Exceptions.

THESIS/DISSERTATION FORMAT AND LAYOUT

An introduction to RDA for cataloguers

MATCHING MUSICAL THEMES BASED ON NOISY OCR AND OMR INPUT. Stefan Balke, Sanu Pulimootil Achankunju, Meinard Müller

Guidelines for University Collaborative Partner Institutions: Use of the University logo

Springer Series in Information Sciences 22

RDA: Resource Description and Access Part I - Review by other rule makers of December 2005 Draft - Germany

INFS 321 Information Sources

12th Grade Language Arts Pacing Guide SLEs in red are the 2007 ELA Framework Revisions.

Evaluation Tools. Journal Impact Factor. Journal Ranking. Citations. H-index. Library Service Section Elyachar Central Library.

SMPTE Technical Paper Style Guide

Transcription:

Uwe Springmann1, Dietmar Najock2, Hermann Morgenroth2, Helmut Schmid1, Annette Gotscharek1 and Florian Fink1 OCR of Historical Printings of Latin Texts Problems, Prospects, 1 CIS, Ludwig-Maximilians-Universität München 2 Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin

Overview Why Latin? Problems Prospects p. 2 (16)

Why Latin? huge heritage: largest body of historical literary sources Latin publications dominate print production until about 1750 many titles have never been reprinted either key or barrier to cultural heritage of the western world has been left out of the IMPACT project despite its importance p. 3 (16)

Problems Some problems for OCR engines historical fonts long s (ſ) historical ligatures: Æ, æ, Œ, œ, st, polytonic Greek words diacritics abbreviations historical spellings p. 4 (16)

Problems Some problems for OCR engines (continued) historical typography and spelling are also a problem for early modern languages ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text but discretionary diacritics are helpful in POS/morphology disambiguation: adverb/vocative: altè/alte adverb/pronoun: quàm/quam conjunction/preposition: cùm/cum ablative/nominative: hastâ/hasta p. 5 (16)

Prospects State of the art example pages 1779 1544 1649 p. 6 (16)

Prospects State of the art results for example pages character accuracy in % Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7 1544 83,14 70,32 74,59 1649 88,07 84,87 78,98 1779 82,13 80,77 75,46 out-of-the-box performance, no language model (or default = English) OCRopus hampered by bad image-text segmentation p. 7 (16)

Prospects Overcoming the obstacles Training (Tesseract, OCRopus) (a) generate pseudo-historical images from existing texts and historicallooking computer fonts (add some degradation to the image) (b) transcribe some real pages and train on true historical fonts Lexical resources (Tesseract) in recognition Post-processing correct OCR errors, not historical spelling (might be interesting itself) add annotation: expand abbreviations, ligatures, normalize spelling helpful: language model, lexicon of historical word forms p. 8 (16)

Historical Lexicon Lextractor Tool Historical spelling variation (here: i j) can be recorded as lexical entities and used to distinguish correct historical spellings from true OCR errors. p. 9 (16)

Postcorrection: Open-Source-Tool PoCoTo (see paper of Vobl et al. - presentation by Christoph Ringlstetter) p. 10 (16)

Training on historical fonts (artificial images) Example: Pontanus, Progymnasmata Latinitatis (1589) p. 11 (16)

Training on fonts, ideal lexicon Example: Pontanus, Progymnasmata Latinitatis (1589) character accuracy in % Page Abbyy FR 11.1 Tesseract 3.03 Ocropus 0.7 Tesseract (font) Tesseract (font + lex.) Ocropus (font) 15 87,79 80,88 80,70 91,02 93,90 92,55 16 82,94 77,41 76,94 80,12 85,65 80,47 17 85,25 75,98 86,07 85,41 91,56 93,93 18 85,93 79,51 85,53 88,29 92,68 89,67 19 87,94 80,09 79,09 86,06 90,15 87,83 OCRopus: no language model! red: accuracy better than Abbyy p. 12 (16)

Training on historical fonts (real images) Example: Thanner, Petronij Arbitri Sathyra (1500), 16 pages p. 13 (16)

Training on historical fonts (real images) Example: Thanner, Petronij Arbitri Sathyra (1500) character accuracy in % Page Tesseract 3.03 Ocropus 0.7 Ocropus (trained) 13 41,59 44,59 93,15 14 52,38 57,77 94,61 15 53,09 62,38 95,17 16 59,09 61,45 93,27 page 1-12: training set; page 13-16: test set p. 14 (16)

Summary very old printings are hard to OCR out-of-the box Tesseract and OCRopus can be trained to results above ABBYY applying lexica as well as font training helps a lot OCRopus can be trained to accuracies > 90%, but must at present be combined with good line segmentation in a preprocessing step postcorrection will do the rest p. 15 (16)

Thank you for your interest! p. 16 (16)