Text Type Classification for the Historical DTA Corpus

Similar documents
Urquhart Memorial Library

Blackwell Reference Online

British National Corpus

CLARIN - NL. Language Resources and Technology Infrastructure for the Humanities in the Netherlands. Jan Odijk NO-CLARIN Meeting Oslo 18 June 2010

SURING ELEM SCHOOL. Analysis Overview. Collection Information Date of Analysis: 08-Apr :44:23

Book Indexes p. 49 Citation Indexes p. 49 Classified Indexes p. 51 Coordinate Indexes p. 51 Cumulative Indexes p. 51 Faceted Indexes p.

Pejorative Language Use in the Satirical Journal Die Fackel as documented in the Dictionary of Insults and Invectives

Europeana Core Service Platform

Cooperation between Turkish researchers and Oxford University Press. Avanos October 2017

A Dictionary of Spoken Danish

BIC Standard Subject Categories an Overview November 2010

ENCYCLOPEDIA DATABASE

Laurent Romary. To cite this version: HAL Id: hal

CLEAR LAKE ELEM SCHOOL

HERITAGE ELEM SCHOOL. Analysis Overview. Collection Information Date of Analysis: 21-May :34:53

BONDUEL ELEM SCHOOL. Analysis Overview. Collection Information Date of Analysis: 29-Mar :42:38

SEBASTIAN MDL SCHOOL Fall 2013

PELICAN ELEM SCHOOL Oct 2010

PRAIRIE ELEM SCHOOL. Analysis Overview. Collection Information Date of Analysis: 10-May :02:04

Suggested Publication Categories for a Research Publications Database. Introduction

ARBORETUM ELEM SCHOOL

NORTHWOODS COMMUNITY ELEM SCH Oct. 1,2010

Guide for Authors. The prelims consist of:

CLARIN AAI Vision. Daan Broeder Max-Planck Institute for Psycholinguistics. DFN meeting June 7 th Berlin

University of Malta Library Reference Collection

Web of Science Unlock the full potential of research discovery

WELLS BRANCH COMMUNITY LIBRARY COLLECTION DEVELOPMENT PLAN JANUARY DECEMBER 2020

SUBJECT DISCOVERY IN LIBRARY CATALOGUES

K-means and Hierarchical Clustering Method to Improve our Understanding of Citation Contexts

Title: Documentation for whom?

MLA In-Text Citations: The Basics

Glendale College Library Information Competency Workshops Introduction to the Library for New Students

LMS301: Reference Management Software (Mendeley)

HKCEE, HKALE and HKDSE text types

Calderdale College Learning Centre. Guide to the Dewey Decimal Classification system

DR. ABDELMONEM ALY FACULTY OF ARTS, AIN SHAMS UNIVERSITY, CAIRO, EGYPT

Department of American Studies M.A. thesis requirements

Access forever : Purchase vs. Subscription of Databases

Abstracts workshops RaAM 2015 seminar, June, Leiden

Biography/Bibliography Form Reformatting Implementation Guidelines for 2015 & 2016

Taxonomy Displays Bridging UX & Taxonomy Design. Content Strategy Seattle Meetup April 28, 2015 Heather Hedden

This policy takes as its starting point the Library's mission statement:

Global Philology Open Conference LEIPZIG(20-23 Feb. 2017)

The unit focuses on features of personal record writing. Pupils read a range of biographical and autobiographical texts and write a short biography.

You can apply for a card at any Queens Library or go to Online Card Registration on the library website and submit an application online.

The Digital Index Chemicus: Creating a Reference Work on the Web from Isaac Newton s Index Chemicus

First Stage of an Automated Content-Based Citation Analysis Study: Detection of Citation Sentences 1

Department of American Studies B.A. thesis requirements

The subject headings are given as a 5-digit number, the digits 1 5 each having their own meaning.

LIBRARY AND INFORMATION SERVICES POLICY. Co-ordinating Exco member Vice-Rector: Research - Prof RC Witthuhn ( )

Digital Text, Meaning and the World

Introduction. The following draft principles cover:

The Reference Collection

MIRA COSTA HIGH SCHOOL English Department Writing Manual TABLE OF CONTENTS. 1. Prewriting Introductions 4. 3.

Dynamics in Document Design: Creating Text for Readers

What is the BNC? The latest edition is the BNC XML Edition, released in 2007.

The University of Manchester Library. My Learning Essentials. Know your sources: Types of information CHEAT

22-27 August 2004 Buenos Aires, Argentina

Usage of provenance : A Tower of Babel Towards a concept map Position paper for the Life Cycle Seminar, Mountain View, July 10, 2006

Third Grade Book: I Love Science: Science For Kids 3rd Grade Books (Children's Science & Nature Books) By Speedy Publishing LLC READ ONLINE

ithemba LABS LIBRARY & INFORMATION SERVICES A BASIC ORIENTATION AND USER GUIDE

Skyview Middle School Library

MLA In-Text Citations: The Basics

Medieval History. Court Rolls of the Manor of Wakefield

Collection Development Policy

EAP269: Preliminary survey of Arabic manuscripts in Djenne, Mali, with a view to a major project of preservation, digitisation and cataloguing

Music Information Retrieval

Foundations in Data Semantics. Chapter 4

Citation Analysis in Research Evaluation

Telescope Bibliometrics 101. Uta Grothkopf & Jill Lagerstrom

Types of Information Sources. Library 318 Library Research and Information Literacy

The digital revolution and the future of scientific publishing or Why ERSA's journal REGION is open access

In Text Parenthetical Citations. Take a moment to carefully consider the placement of the parts and punctuation of this in text

Shortwood Teachers College 77 Shortwood Road Kingston 8. Tel(876) , ext. 2222

Historical Corpora. Jost Gippert / Ralf Gehrke (eds.) Challenges and Perspectives. Korpuslinguistik und interdisziplinäre Perspektiven auf Sprache

IT 601 Advanced Video Production

RESEARCH TOOLS GUIDE NOODLETOOLS ICONN WEB EVALUATION

The John Kinder Theological Library. Using library resources effectively to support your study

Bibliometric practices and activities at the University of Vienna

LIBRARY. Guide 12. Library services for International students

Analysis of the Occurrence of Laughter in Meetings

Basic in-text citation rules

Analysis of E-book Use: The Case of ebrary

Figures in Scientific Open Access Publications

Add note: A note instructing the classifier to append digits found elsewhere in the DDC to a given base number. See also Base number.

How comprehensive is the PubMed Central Open Access full-text database?

Scientific Publishing at Karger

ManusOnLine. the Italian proposal for manuscript cataloguing: new implementations and functionalities

NEH-Funded Brittle Books Microfilming: Cumulative Statistics of Harvard s Contributions

UNIVERSITY OF NOTTINGHAM MANUSCRIPTS AND SPECIAL COLLECTIONS. Acquisitions Policy for Rare Books

Knowledge, Support, Innovation... Ryan Scicluna Outreach Department 1 st October 2014

U.S. SJWP National Paper Guidelines

READING BIBLIOGRAPHIES: METHODS OF SEMI-AUTOMATIC CATEGORIZATION OF SHORT TEXTS

Writing a College Paper Step-by-Step: The Value of Outlining SEE BELOW FOR PROPER CITATION

Sonata VI, Op. 30, No. 1, In A Major (Belwin Edition) By Beethoven;Ludwig Van

Collection Development Policy, Modern Languages

University of Malta Library.

Making e-books more visible and accessible in Sierra and Opac using Create Lists, Load Tables and Marc edit

ipl2 Reference by Megan McCrery

Assessing the Significance of a Museum Object

Transcription:

Text Type Classification for the Historical DTA Corpus Susanne Haaf Deutsches Textarchiv, BBAW Berlin NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology: Results and Perspectives

About the Project Deutsches Textarchiv/ German Text Archive (DTA) Funding: Partner: Duration: 2007-2014/15 Goal: Provide the basis for a reference corpus for the development of the New High German language (17 th to 19 th century)

About the Project Ca. 1,500 texts of different disciplines and text types Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) DTA 'Base Format' Guidelines for the transcription closely to the source Structural XML-annotation according to TEI/P5 Guidelines for metadata entry Web-based quality assurance DTA-Extensions Integration of historical text data from other project contexts Curation and Collection of diverse text resources

The DTA Bibliography Selection of works for the DTA core corpus: fixed bibliography Bibliography was created with the help of BBAW members, i.e. experts for the (history of) different (scientific) disciplines Requirements for the Selection reflect the diversity of text types at different points in time represent works which were Important for the scientific field Or: Widely recognised (i.e. of huge public influence) Or even: Not very influential Genuinely lexicographic approach Phase 3: New selection of another 200 works Filling gaps considering time considering text type

Text Type Classification for the DTA Created in a data-driven way, i.e.: New book in the DTA corpus Is there an existing category that fits? Yes? Assign the fitting existing category! No? Create new category! Based on the classification of the DWDS (Digital Dictionary of the German Language) which was continually extended

Text Type Classification for the DTA 3 main (super-)categories: 2 levels: super- & sub-categories

Text Type Classification for the DTA Fiction: Drama, Lyrics, Prose Biography, Epistolary Novel, Travel Literature, Novels, Children's Books, Functional Literature: Handbooks (Good Behaviour/Etiquette, Pedagogy, Gardening, ) Travel Books, Cookbooks, Newspapers, Devotional Literature Scientific Texts: Science: Biology, Geography, Medicine, Chemistry, Humanities: Literature, Linguistics, History, Musical Studies, Social Sciences and Economics

Text Type Classification for the DTA

Text Type Classification What for?

1. Access based on Text Types http://www.deutschestextarchiv.de

1. Access based on Text Types http://www.deutschestextarchiv.de/list/browse?genre=gebrauchsliteratur

2. Queries based on Text Types Travel destinations mentioned in functional literature?

3. Analyses based on Text Types Fiction Functional Literature Science

3. Analyses based on Text Types Kid's Toy (Germanet) within Fictional Literature Query: Kinderspielzeug gn-sub #has[textclassdwds, /Gebrauchsliteratur/]

Problem statement Text classification created in a data-driven way: It only shows what we have but it gives no clues about what we do not have (i.e. text types important for a certain time which are not represented by the DTA corpus) Hence it is difficult to evaluate the representativity of the DTA corpus in this respect The DTA text classification is not mapped to existing classifications of significance There are only two layers leading to ambiguities e.g. Funeral Sermons: Functional Literature::Theology? Functional Literature::FuneralSermon? Functional Literature::SpecialOccasion?

Solution: Switch to an existing classification? Example AAD: Classification of the Working Group on Old Prints by the huge German libraries Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps) Other text types are incoherently modeled In some cases it is too detailed for us In other cases it is not detailed enough Sometimes no descriptions at all or descriptions which are not extensive enough Text types belong to different description levels (text type vs. knowledge area )

AAD: Incoherences #OfficialPrintedPublication (1) #OfficialPrintedPublication (2) SubC: Law #law SupC: OfficialPrintedPublication #CollectionOfLaws BS: Use synonymon OB: Supercategory UB: Subcategory

Solution: Switch to an existing classification? Example AAD: Classification of the Working Group on Old Prints by the huge German libraries Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps) Other text types are incoherently modeled In some cases it is too detailed for us In other cases it is not detailed enough Sometimes no descriptions at all or descriptions which are not extensive enough Text types belong to different description levels (text type vs. knowledge area )

AAD: Different Description Levels #Catechism Type of text presentation #Children's Book Type of intended usage #Church Song Text type #Rhetorics Knowledge area BS: Use synonymon OB: Supercategory UB: Subcategory

Solution: Revised DTA text type classification Redesign and extend the DTA text type classification based on different existing classfications Mapping from the one to the other DTA text types can semi-automatically be transfered to the new classification (Digitized) works of text types still missing in the corpus can be found from library catalogues Sources: AAD (http://aad.gbv.de/empfehlung/aad_gattung.pdf) Wikisource (http://de.wikisource.org/wiki/wikisource:systematik) DWDS DTA

Solution: Revised DTA text type classification Small set of Supercategories Non-fiction Scientific Literature Functional Literature Fiction Detailed (but still manageable) set of subcategories Hierarchies are allowed but kept shallow Descriptions/Documentation

Revised DTA text type classification Classification of text types (i.e. of the subcategories) Präsentationsform (i.e. Type of text presentation) Flyer, Funeral Print (Funeralschrift), Book of Prayer, Cookbook, Catalogue Sitz im Leben (i.e. Life context which texts are embedded in) Devotional Literature, Texts for/from women, Occasional texts Textsorte (Text type) Poem, Novel, Scientific Paper Wissensbereich (i.e. Knowledge area covered by the text) Theology, Chemistry, Math, Linguistics

Term description (via exist) <term type="texttype" source="#aad" id="autobiography"> <name>autobiography</name> <desc type="main"> <p>life memories; Description of historical events by personal witnesses</p> <bibl>aad</bibl> </desc> <desc type="alternative-1">[ ]</desc> <subordinates/> <superordinates> <term id="#biography"/> </superordinates> <mapping> <term source="#dwds">autobiography</term> </mapping> [features, notes, ] </term>

Term description (via exist) <term type="texttype" source="#dta" id="flyer"> <name>flyer</name> <desc type="main"> <p>easily produced broschure, produced for the purpose of agitation, information, or documentation</p> <bibl>cf. AAD</bibl> </desc> <desc type="alternative-1">[ ]</desc> [ ] <mapping> <term source="#aad">flyer</term> <term source="#aad">broadsheet</term> </mapping> [features, notes, ] </term>

Thank you! Contact: haaf@bbaw.de Project Deutsches Textarchiv: www.deutschestextarchiv.de www.deutschestextarchiv.de/doku/basisformat www.deutschestextarchiv.de/dtaq www.deutschestextarchiv.de/dtae Literature: www.deutschestextarchiv.de/doku/publikationen