OCR of Historical Printings of Latin Texts

Uwe Springmann1, Dietmar Najock2, Hermann Morgenroth2, Helmut Schmid1, Annette Gotscharek1 and Florian Fink1 OCR of Historical Printings of Latin Texts Problems, Prospects, 1 CIS, Ludwig-Maximilians-Universität München 2 Institute for Greek and Latin Languages and Literatures, Freie Universität Berlin

Overview Why Latin? Problems Prospects p. 2 (16)

Why Latin? huge heritage: largest body of historical literary sources Latin publications dominate print production until about 1750 many titles have never been reprinted either key or barrier to cultural heritage of the western world has been left out of the IMPACT project despite its importance p. 3 (16)

Problems Some problems for OCR engines historical fonts long s (ſ) historical ligatures: Æ, æ, Œ, œ, ﬆ, polytonic Greek words diacritics abbreviations historical spellings p. 4 (16)

Problems Some problems for OCR engines (continued) historical typography and spelling are also a problem for early modern languages ambiguities of abbreviations (especially in incunabula) will not immediately lead to fully expanded, machine readable text but discretionary diacritics are helpful in POS/morphology disambiguation: adverb/vocative: altè/alte adverb/pronoun: quàm/quam conjunction/preposition: cùm/cum ablative/nominative: hastâ/hasta p. 5 (16)

Prospects State of the art example pages 1779 1544 1649 p. 6 (16)

Prospects State of the art results for example pages character accuracy in % Year Abbyy FR 11.1 Tesseract 3.03 OCRopus 0.7 1544 83,14 70,32 74,59 1649 88,07 84,87 78,98 1779 82,13 80,77 75,46 out-of-the-box performance, no language model (or default = English) OCRopus hampered by bad image-text segmentation p. 7 (16)

Prospects Overcoming the obstacles Training (Tesseract, OCRopus) (a) generate pseudo-historical images from existing texts and historicallooking computer fonts (add some degradation to the image) (b) transcribe some real pages and train on true historical fonts Lexical resources (Tesseract) in recognition Post-processing correct OCR errors, not historical spelling (might be interesting itself) add annotation: expand abbreviations, ligatures, normalize spelling helpful: language model, lexicon of historical word forms p. 8 (16)

Historical Lexicon Lextractor Tool Historical spelling variation (here: i j) can be recorded as lexical entities and used to distinguish correct historical spellings from true OCR errors. p. 9 (16)

Postcorrection: Open-Source-Tool PoCoTo (see paper of Vobl et al. - presentation by Christoph Ringlstetter) p. 10 (16)

Training on historical fonts (artificial images) Example: Pontanus, Progymnasmata Latinitatis (1589) p. 11 (16)

Training on fonts, ideal lexicon Example: Pontanus, Progymnasmata Latinitatis (1589) character accuracy in % Page Abbyy FR 11.1 Tesseract 3.03 Ocropus 0.7 Tesseract (font) Tesseract (font + lex.) Ocropus (font) 15 87,79 80,88 80,70 91,02 93,90 92,55 16 82,94 77,41 76,94 80,12 85,65 80,47 17 85,25 75,98 86,07 85,41 91,56 93,93 18 85,93 79,51 85,53 88,29 92,68 89,67 19 87,94 80,09 79,09 86,06 90,15 87,83 OCRopus: no language model! red: accuracy better than Abbyy p. 12 (16)

Training on historical fonts (real images) Example: Thanner, Petronij Arbitri Sathyra (1500), 16 pages p. 13 (16)

Training on historical fonts (real images) Example: Thanner, Petronij Arbitri Sathyra (1500) character accuracy in % Page Tesseract 3.03 Ocropus 0.7 Ocropus (trained) 13 41,59 44,59 93,15 14 52,38 57,77 94,61 15 53,09 62,38 95,17 16 59,09 61,45 93,27 page 1-12: training set; page 13-16: test set p. 14 (16)

Summary very old printings are hard to OCR out-of-the box Tesseract and OCRopus can be trained to results above ABBYY applying lexica as well as font training helps a lot OCRopus can be trained to accuracies > 90%, but must at present be combined with good line segmentation in a preprocessing step postcorrection will do the rest p. 15 (16)

Thank you for your interest! p. 16 (16)