Text Type Classification for the Historical DTA Corpus Susanne Haaf Deutsches Textarchiv, BBAW Berlin NeDiMAH-CLARIN-Workshop Exploring Historical Sources with Language Technology: Results and Perspectives
About the Project Deutsches Textarchiv/ German Text Archive (DTA) Funding: Partner: Duration: 2007-2014/15 Goal: Provide the basis for a reference corpus for the development of the New High German language (17 th to 19 th century)
About the Project Ca. 1,500 texts of different disciplines and text types Automatic linguistic analysis (lemmatization, tokenization, POS-tagging, orthographic normalization) DTA 'Base Format' Guidelines for the transcription closely to the source Structural XML-annotation according to TEI/P5 Guidelines for metadata entry Web-based quality assurance DTA-Extensions Integration of historical text data from other project contexts Curation and Collection of diverse text resources
The DTA Bibliography Selection of works for the DTA core corpus: fixed bibliography Bibliography was created with the help of BBAW members, i.e. experts for the (history of) different (scientific) disciplines Requirements for the Selection reflect the diversity of text types at different points in time represent works which were Important for the scientific field Or: Widely recognised (i.e. of huge public influence) Or even: Not very influential Genuinely lexicographic approach Phase 3: New selection of another 200 works Filling gaps considering time considering text type
Text Type Classification for the DTA Created in a data-driven way, i.e.: New book in the DTA corpus Is there an existing category that fits? Yes? Assign the fitting existing category! No? Create new category! Based on the classification of the DWDS (Digital Dictionary of the German Language) which was continually extended
Text Type Classification for the DTA 3 main (super-)categories: 2 levels: super- & sub-categories
Text Type Classification for the DTA Fiction: Drama, Lyrics, Prose Biography, Epistolary Novel, Travel Literature, Novels, Children's Books, Functional Literature: Handbooks (Good Behaviour/Etiquette, Pedagogy, Gardening, ) Travel Books, Cookbooks, Newspapers, Devotional Literature Scientific Texts: Science: Biology, Geography, Medicine, Chemistry, Humanities: Literature, Linguistics, History, Musical Studies, Social Sciences and Economics
Text Type Classification for the DTA
Text Type Classification What for?
1. Access based on Text Types http://www.deutschestextarchiv.de
1. Access based on Text Types http://www.deutschestextarchiv.de/list/browse?genre=gebrauchsliteratur
2. Queries based on Text Types Travel destinations mentioned in functional literature?
3. Analyses based on Text Types Fiction Functional Literature Science
3. Analyses based on Text Types Kid's Toy (Germanet) within Fictional Literature Query: Kinderspielzeug gn-sub #has[textclassdwds, /Gebrauchsliteratur/]
Problem statement Text classification created in a data-driven way: It only shows what we have but it gives no clues about what we do not have (i.e. text types important for a certain time which are not represented by the DTA corpus) Hence it is difficult to evaluate the representativity of the DTA corpus in this respect The DTA text classification is not mapped to existing classifications of significance There are only two layers leading to ambiguities e.g. Funeral Sermons: Functional Literature::Theology? Functional Literature::FuneralSermon? Functional Literature::SpecialOccasion?
Solution: Switch to an existing classification? Example AAD: Classification of the Working Group on Old Prints by the huge German libraries Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps) Other text types are incoherently modeled In some cases it is too detailed for us In other cases it is not detailed enough Sometimes no descriptions at all or descriptions which are not extensive enough Text types belong to different description levels (text type vs. knowledge area )
AAD: Incoherences #OfficialPrintedPublication (1) #OfficialPrintedPublication (2) SubC: Law #law SupC: OfficialPrintedPublication #CollectionOfLaws BS: Use synonymon OB: Supercategory UB: Subcategory
Solution: Switch to an existing classification? Example AAD: Classification of the Working Group on Old Prints by the huge German libraries Certain text types we need are not represented (e.g. gardening), others we don't need (e.g. maps) Other text types are incoherently modeled In some cases it is too detailed for us In other cases it is not detailed enough Sometimes no descriptions at all or descriptions which are not extensive enough Text types belong to different description levels (text type vs. knowledge area )
AAD: Different Description Levels #Catechism Type of text presentation #Children's Book Type of intended usage #Church Song Text type #Rhetorics Knowledge area BS: Use synonymon OB: Supercategory UB: Subcategory
Solution: Revised DTA text type classification Redesign and extend the DTA text type classification based on different existing classfications Mapping from the one to the other DTA text types can semi-automatically be transfered to the new classification (Digitized) works of text types still missing in the corpus can be found from library catalogues Sources: AAD (http://aad.gbv.de/empfehlung/aad_gattung.pdf) Wikisource (http://de.wikisource.org/wiki/wikisource:systematik) DWDS DTA
Solution: Revised DTA text type classification Small set of Supercategories Non-fiction Scientific Literature Functional Literature Fiction Detailed (but still manageable) set of subcategories Hierarchies are allowed but kept shallow Descriptions/Documentation
Revised DTA text type classification Classification of text types (i.e. of the subcategories) Präsentationsform (i.e. Type of text presentation) Flyer, Funeral Print (Funeralschrift), Book of Prayer, Cookbook, Catalogue Sitz im Leben (i.e. Life context which texts are embedded in) Devotional Literature, Texts for/from women, Occasional texts Textsorte (Text type) Poem, Novel, Scientific Paper Wissensbereich (i.e. Knowledge area covered by the text) Theology, Chemistry, Math, Linguistics
Term description (via exist) <term type="texttype" source="#aad" id="autobiography"> <name>autobiography</name> <desc type="main"> <p>life memories; Description of historical events by personal witnesses</p> <bibl>aad</bibl> </desc> <desc type="alternative-1">[ ]</desc> <subordinates/> <superordinates> <term id="#biography"/> </superordinates> <mapping> <term source="#dwds">autobiography</term> </mapping> [features, notes, ] </term>
Term description (via exist) <term type="texttype" source="#dta" id="flyer"> <name>flyer</name> <desc type="main"> <p>easily produced broschure, produced for the purpose of agitation, information, or documentation</p> <bibl>cf. AAD</bibl> </desc> <desc type="alternative-1">[ ]</desc> [ ] <mapping> <term source="#aad">flyer</term> <term source="#aad">broadsheet</term> </mapping> [features, notes, ] </term>
Thank you! Contact: haaf@bbaw.de Project Deutsches Textarchiv: www.deutschestextarchiv.de www.deutschestextarchiv.de/doku/basisformat www.deutschestextarchiv.de/dtaq www.deutschestextarchiv.de/dtae Literature: www.deutschestextarchiv.de/doku/publikationen