Digital Editions for Corpus Linguistics

Similar documents
Digital Editions for Corpus Linguistics: Representing manuscript reality in electronic corpora

CLARIN - NL. Language Resources and Technology Infrastructure for the Humanities in the Netherlands. Jan Odijk NO-CLARIN Meeting Oslo 18 June 2010

Digital Text, Meaning and the World

GUIDELINES FOR SCHOLARLY EDITIONS LAST REVISED, OCTOBER 1992

The Occom Circle: Editorial Statement

Laurent Romary. To cite this version: HAL Id: hal

ENCYCLOPEDIA DATABASE

(web semantic) rdt describers, bibliometric lists can be constructed that distinguish, for example, between positive and negative citations.

Editing for man and machine

PUBLIC SOLUTIONS SERIES:

Guide to contributors. 1. Aims and Scope

Aggregating Digital Resources for Musicology

THESIS FORMATTING GUIDELINES

Manuscript Description

Department of American Studies M.A. thesis requirements

Instructions to Authors

Writing Assignments: Annotated Bibliography + Research Paper

SIMSSA DB: A Database for Computational Musicological Research

WORLD LIBRARY AND INFORMATION CONGRESS: 75TH IFLA GENERAL CONFERENCE AND COUNCIL

Tamar Sovran Scientific work 1. The study of meaning My work focuses on the study of meaning and meaning relations. I am interested in the duality of

This paper was originally presented to staff and students at the AHRB Centre for Editing Lives and Letters research seminar, October 2003.

British National Corpus

ITU-T Y.4552/Y.2078 (02/2016) Application support models of the Internet of things

Charters Encoding Initiative Overview

Global Philology Open Conference LEIPZIG(20-23 Feb. 2017)

Guidelines for DD&R Summary Preparation

CS-M00 Research Methodology Lecture 28/10/14: Bibliographies

Digital Modelling. (modelling the digital edition) Patrick Sahle

22-27 August 2004 Buenos Aires, Argentina

The Right Stuff at the Right Cost for the Right Reasons

Using Primo for searching Archives and Manuscripts: challenges and an approach. Richard Masters: IGeLU, Helsinki, 8 September 2009

Modelling Intellectual Processes: The FRBR - CRM Harmonization. Authors: Martin Doerr and Patrick LeBoeuf

BREPOLS PERIODICA ONLINE ARCHIVE

DR. ABDELMONEM ALY FACULTY OF ARTS, AIN SHAMS UNIVERSITY, CAIRO, EGYPT

Guidelines for Manuscript Preparation for Advanced Biomedical Engineering

Overview. Project Shutdown Schedule

ANSI/SCTE

OPEC ENERGY REVIEW AUTHOR GUIDELINES. March 2015

New Wittgenstein Nachlass facsimile and text editions

The multicultural-scope of the services offered by the Miguel de Cervantes digital library project.

ETHNOMUSE: ARCHIVING FOLK MUSIC AND DANCE CULTURE

Suggested Publication Categories for a Research Publications Database. Introduction

GUIDELINES FOR THE PREPARATION OF A GRADUATE THESIS. Master of Science Program. (Updated March 2018)

TESL-EJ Style Sheet for Authors

Publishing a Journal Article

Language Use your native form of English in your manuscript, including your native spelling and punctuation styles.

Reference Management with. EndNote X8 PC. Guide for Students and Researchers

Manuscript Preparation Guidelines for IFEDC (International Fields Exploration and Development Conference)

Submission Checklist

Welsh print online THE INSPIRATION THE THEATRE OF MEMORY:

All authors have complied with the disclosure requirements and made any needed updates since time of first submission

CS-M00 Research Methodology Lecture 5: Bibliographies

Guideline: Transcription

Writing Styles Simplified Version MLA STYLE

Abstract. Justification. 6JSC/ALA/45 30 July 2015 page 1 of 26

Digital Editing and the Medieval Manuscript Fragment

Jerry Falwell Library RDA Copy Cataloging

ARCHIVAL DESCRIPTION GOOD, BETTER, BEST

Section 1 The Portfolio

T : Internet Technologies for Mobile Computing

Archaeologies of Reading: Modeling and Recreating the Annotation Practices of Gabriel Harvey, John Dee, Jacques Derrida, and the Winthrop Family

Guidelines for Publishing with the Society of American Archivists (SAA)

How to read scientific papers? Ali Sharifara Summer 2017 CSE, UTA

Variations2: The Indiana University Digital Music Library Project

Chapter 1 INTRODUCTION

INSTRUCTIONS FOR AUTHORS

Publication Policy and Guidelines for Authors

GRADUATE SCHOOL GUIDELINES FOR USERS OF USM LaTeX

INFS 427: AUTOMATED INFORMATION RETRIEVAL (1 st Semester, 2018/2019)

Thesis/Dissertation Preparation Guidelines

Music and Text: Integrating Scholarly Literature into Music Data

ENGINEERING COMMITTEE Energy Management Subcommittee SCTE STANDARD SCTE

Information Products in CPC version 2

Guide for Authors. Before you begin

Instructions to Authors

Article begins on next page

School of Graduate Studies and Research

Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations

Principal version published in the University of Innsbruck Bulletin of 4 June 2012, Issue 31, No. 314

ManusOnLine. the Italian proposal for manuscript cataloguing: new implementations and functionalities

ENGLISH STUDIES SUMMER SEMESTER 2017/2018 CYCLE/ YEAR /SEMESTER

THE JOURNAL OF NAVIGATION Instructions for Contributors 1

Ontology Representation : design patterns and ontologies that make sense Hoekstra, R.J.

Version 0.5 (9/7/2011 4:18:00 a9/p9 :: application v2.doc) Warning

COLLECTION DEVELOPMENT POLICY OF THE NATIONAL LIBRARY OF FINLAND

Managing content in the electronic world Anne Knight Acting Head of Information Systems / Resources & Facilities Manager

Instruction for Authors

Graduate School of Biomedical Sciences. MS in Clinical Investigation Preparing for your Master s Thesis and Graduation

Usage of provenance : A Tower of Babel Towards a concept map Position paper for the Life Cycle Seminar, Mountain View, July 10, 2006

CALL FOR PAPERS. standards. To ensure this, the University has put in place an editorial board of repute made up of

MIRA COSTA HIGH SCHOOL English Department Writing Manual TABLE OF CONTENTS. 1. Prewriting Introductions 4. 3.

THESES AND DISSERTATIONS FOR Ed.D. and M.S.Ed. DEGREES

Tool-based Identification of Melodic Patterns in MusicXML Documents

Between Concept and Form: Learning from Case Studies

Using EndNote X7 for Windows to Manage Bibliographies A Guide to EndNote for Windows by Information Services Staff of UTS Library

1 Guideline for writing a term paper (in a seminar course)

Using EndNote X7 to Manage Bibliographies on a Mac!

UTS: Library Using EndNote X8 for Windows. A guide to EndNote X8 for Windows by Information Services Staff

Communication & Medicine

Publishing India Group

Transcription:

Digital Editions for Corpus Linguistics A new approach to creating editions of historical manuscripts Alpo Honkapohja Samuli Kaislaniemi Ville Marttila University of Helsinki Digital Humanities conference www.helsinki.fi/varieng/domains/decl.html Oulu, 25-29 June 2008 -Apologies for Alpo & Ville not making it to DH 2008 -Properly, our project could be called: creating digital editions of historical manuscripts which are suitable for corpus linguistic enquiry -The idea is forlinguistically oriented online digital editions of historical manuscripts -It s worth emphasising that these editions are not only meant for corpus linguistics!! 1

Outline of this talk Background: historical corpus linguistics DECL: theoretical principles Current status and the first DECL editions A look at a DECL edition -Before describing the project as it stands now, this presentation will spend some time discussing the rationale behind the project in order to give a better picture of what it is DECL is after 2

Background 1: Historical corpus linguistics Research Unit for Variation, Contacts and Change in English (VARIENG), University of Helsinki Aim: to study variation and long-term change in English with the help of electronic corpora The Helsinki Corpus of English Texts (1991) multi-genre corpus of English c. 450 texts (c. 1.6m words) spans the years 730-1710 available from OTA -This is where DECL springs from 3

Background 1: Historical corpus linguistics cont d The Corpus of Early English Correspondence (CEEC; in progress) currently c. 12,000 personal letters (c. 5.1m words) spans the years 1403-1800 published subcorpora available from OTA The Corpus of Early English Medical Writing (CEEM; in progress) c. 3.75m words (estimate) spans the years c. 1375-1800 published subcorpus available from Benjamins (CD-ROM) -We ve been involved with these projects, helping compile corpora: Sam with CEEC, Alpo and Ville with CEEM -These are 2nd gen. corpora, designed to answer specific research questions -CEEC is created for historical sociolinguistics: the influence of social factors on linguistic use -CEEM for the scientific thought-styles project: stylistic changes in medical texts -Both corpora are based on historical manuscripts - in edited, printed form (for the most part) 4

Background 2: Conventional historical corpora Based on printed editions of historical texts + Compilation easier and faster than from manuscript sources Link with manuscript originals lost Editorial principles vary Orthography unreliable Manuscript features rarely marked Copyright issues Duplication of effort and errors -Historical corpora have usually been compiled from a group of sources -The creation of historical corpora from printed editions has been called philological outsourcing (Dollinger, Stefan. 2004. Philological computing vs. philological outsourcing and the compilation of historical corpora: A Late Modern English test case. Vienna English Working Papers (VIEWS) 13 (2), 3 23.). -Contextual information is also not necessarily/usually contained in the corpus or the manual of the corpus (should one exist) 5

Background 3: Historical corpora based on manuscripts Middle English Grammar Corpus (MEG-C) Salem Witch Trials Corpus Corpus of Scottish Correspondence (CSC) A Corpus of Late Eighteenth-Century Prose -Meg-C & Salem were created from edited material, but checked against the original manuscripts -CSC & L18C were created from mss -MEG-C: c. 3m words -Salem: 1,000 records -CSC: 256,593 -L18thC: 300,000 words -Some problems: -CSC and L18C are not editions, which is a shame -MEG-C & Salem are great multidisciplinary projects, but like CSC & L18C, they don t use XML, which arguably would enable easy conversion to other formats 6

Background 4: Digital editions Could be more versatile and user-friendly overall e.g. restricted text searches, hard to manipulate texts Have yet to become a norm for publishing historical texts Labour-intensive -Compared to corpora editions, on the other hand, are not usually suited for linguistic enquiry - which is also a shame -Arguably, though, they are becoming a norm albeit sloooowly -Lack of publishing infrastructure and user-friendly tools limits individual scholars and small-scale projects -Cf. use of online text databases for non-intended purposes: e.g. historian John Styles using the Old Bailey Online to study 18th-century interiors -..In any case, it is to bring the requirements of users of digital editions and linguistic corpora closer together, that we have initiated.. 7

The Digital Editions for Corpus Linguistics project (DECL) Aims of the project: 1. To create editions that function as both editions and corpora allowing equally: comparison of manuscript image and diplomatic transcript textual searches of transcripts and linguistic tags 2. To create a framework which makes the creation of such editions easy and is readily adaptable to different types of historical texts -DECL was started in January 2008 by Alpo, Ville and Sam -Comparison of image and transcript is a typical feature of editions -Sophisticated text searches, on the other hand, are a requisite of corpus functionality -Editing historical manuscripts is not easy and electronic editing even less so, with technical features to learn -Further, editions usually fall short of expectations -DECL hopes to help! 8

The Digital Editions for Corpus Linguistics project (DECL) Features: Historical manuscripts Meant for individual scholars and small-scale research projects Using existing standards and tools Open Source principle -DECL is intended to feed the need for further edited historical material -Our focus is not on literary material, although we do not exclude it -DECL is not intended for large-scale, well-funded projects, instead, our concern is for scholars without infinite time and money -For instance, PhD corpora tend to be compiled, then forgotten/lost -DECL aims to help the creation of versatile digital editions which are compatible with other similar resources -We do not want to reinvent the wheel - instead, bolster existing (widely approved) standards and good tools -e.g. TEI -We also embrace free software and resources, and will publish DECL editions under a suitable Creative Commons license 9

The Digital Editions for Corpus Linguistics project (DECL) Theoretical principles 10

Theoretical background Retaining manuscript reality The focus must be equally on artefact, text and context Artefact = the physical manuscript Text = the linguistic contents of the artefact Context = the cultural circumstances relating to the text and the artefact -The challenge is to represent manuscript reality in a form that suits the needs of historians as well as corpus linguists This includes fields such as historical pragmatics, sociolinguistics and discourse analysis -These all need evidence of language production and the social context of the texts -Note: instead of document used at this point in our abstract, we now feel that artefact lessens chances of confusion -We're still defining and refining terms and concepts, and these are subject to change -All of these levels are considered equally important facets of the manuscript reality and should be represented in the edition -We aim at "documentary editing", our primary concern is to retain authenticity: the link to the original manuscripts 11

Theoretical background cont'd Roger Lass (2004): A historical corpus should: preserve the text as accurately and faithfully as possible convey it in as flexible a form as possible ensure that any editorial intervention remains visible and reversible -Our starting point is Lass's 2004 article (Lass, Roger. 2004. Ut custodiant litteras: Editions, Corpora and Witnesshood. In Methods and Data in English Historical Dialectology, ed. by Marina Dossena and Roger Lass. Bern: Peter Lang, 21 48. [Linguistic Insights 16]), where he argued that historical corpora compiled from editions fail to represent linguistic reality -Others have voiced the same concerns (for references, see handout) -However, DECL does not follow Lass's model in transcribing or encoding texts -DECL is concerned with creating editions from historical manuscripts, and not corpora 12

Three guiding principles: flexibility, transparency and expandability -Based on Lass (2004), we have derived some guiding theoretical principles 13

Three guiding principles: 1. Flexibility XML Based on and compatible with TEI P5 guidelines. Easily convertible into various database and document formats. Text and metadata present in tagging, and combinableinto new documents (e.g. subcorpora) Layered structure with customisable online interface Allows viewing, searching and downloading relevant aspects of the edition, while leaving others out. Platform-independent solutions based on the Opensource principle Tools, texts and tagging can be freely downloaded and modified by users 14

Three guiding principles: 2. Transparency All editorial intervention indicated by markup Explicitly distinguished from the unemended transcription Reversible All layers of the edition accessible Manuscript images, raw transcript and annotation Allows users to (re-)evaluate editorial decisions Detailed documentation -Ambiguities are marked as such -Transcription principles, editorial policies and encoding practices will be thoroughly documented -DECL guidelines will be a subset of the TEI P5 guidelines 15

Three guiding principles: 3. Expandability Uniform editorial and encoding practices Ensure comparability of DECL editions Allow editions to be combined into corpora Modular architecture Allows new documents to be added to editions Layered structure Allows new layers of annotation to be added to an edition 16

DECL editions underway Alpo Honkapohja: A Digital Edition of MS O.1.77 Trinity College Cambridge late Middle English pocket-sized medical handbook produced ca. 1460 in London or Westminster c. 15 texts, in English and Latin will be the first bilingual digital edition of early scientific writing in England Samuli Kaislaniemi: The Life and Early Letters of Richard Cocks, English Merchant (1600-1610) Early Modern English intelligence letters written between 1600 and 1610 from Bayonne in France c. 125 letters, with some abstracts and other documents Ville Marttila: Potage Dyvers: A digital edition of a family of late medieval culinary recipe collections six closely related Middle English culinary recipe collections dating from the 15th century over 200 recipes -These are all PhD theses, or material compiled for our PhD theses -We would welcome other projects, even at this early stage: more text types would/will lead to better general DECL guidelines -The first edition (Sam s) should be done in 2010; all three within 4-5 years -The DECL framework should be available 2010/2011 (v.1.0); a finished version c. 2012 17

Current state of the project Working on: Transcriptions of manuscripts DECL guidelines (from TEI P5) PR Looking for funding -Transcriptions: Sam has finished round one, Alpo & Ville should finish round one by the end of 2008 -Naturally the texts will need another two rounds of proofreading later on 18

Things on the To Do list Repository for DECL editions University of Helsinki (CARHU Campus Repository of Humanities and Social Sciences)? User interface Online editions, so work with web browsers..things we haven t even thought of yet please tell us!! -The University of Helsinki has recently been very active in developing the sustainability of electronic resources created by the University members -The University, together with the Finnish National Archives and the National Library, are also actively involved in pan-european developments in the field -For the user interface - as with nearly everything - we will use existing solutions -It s too early to say anything concrete 19

Texts Copyright issues Transcriber holds copyright Images Archives generally lenient (?) Libraries tend to demand money Software / tools / solutions Open Source, free software, Creative Commons licenses -We have given initial thought to copyright issues -The DECL framework and editions will be (inshallah) released under a suitable Creative Commons license -This can be seen as passing the issue of sustainability to the public domain.. -Yet using existing and emerging standards will help strengthen them 20

A look at a DECL edition It s only a model -Ok, let s have a look at a DECL edition. -..Having said that, this is very much a mock-up, and technical details are all work-in-progress 21

2 Pragmatic annotation Semantic annotation Discourse annotation Syntactic Parsing Linguistic Analysis Manuscript Editing Lemmatisation 1 Normalised textual content spelling variants normalised to standard forms POS tagging Links parallel versions, intertextuality, related texts, glossaries Manuscript features hands, abbreviation, decoration, emendation, annotation Textual content text tokenised into uniquely identified word-units Manuscript description catalogue information, documenting hands & abbreviations 3 Manuscript structure pagination, lineation, layout Textual structure parts, chapters, paragraphs Editorial notes textual & explanatory notes Manuscript images high-resolution digital images Transcription diplomatic, graphemic, unemended, original punctuation & word division Original Manuscript - Diagram of the modular and layered structure of a DECL edition 1. Normalisation is part of creating the edition (spelling variation) 2. Thus, it becomes easy to use automated tools for linguistic annotation 3. And all this without losing the link to the mss sources 22

Technical principles TEI-compliant XML (P5) Text-type specific features as additional modules to the TEI schema Modular structure Layers of annotation Standoff markup -The XML-based encoding will -allow the contents of the editions to be used with any XML-aware tools -Allow easy conversion to other document or database standards -Allow easy addition of further layers of annotation 23

Demo 1: Manuscript facsimile -A linguistically oriented edition needs to be based on the original manuscript: -Images preferably of better quality than the microfilm reproduction shown here should be obtained -To serve both as an aid to preliminary transcription and as a component of the finished edition -This will ensure complete transparency, as the user can verify any editorial readings against the image of the original -The process begins with producing a raw transcription of the manuscript text 24

Demo 2: Raw transcript -The raw transcription contains all of the textual features (abbreviations, superscripts, additions & deletion, etc.) that are to be included in the edition -Using shorthand notation (representative symbols) at this stage speeds up keying and proofing -Notation should be unambiguous and as far as possible automatically convertible to the final XML -In this example abbreviations have been indicated by various special characters and their expansion given in parentheses -Additions by a later hand has been indicated by boldface -Underlinings and strikethroughs indicated by corresponding formatting -Features that cannot be automatically processed should be indicated clearly so that they can be searched for and marked up manually -The raw transcript made from a manuscript image should then be proofed against the manuscript, noting down features not apparent in the image 25

Demo 3: XML encoding -After the raw transcription has been checked, it will be converted into XML following the DECL guidelines -Explicit tagging for lineation, paragraphs and other structural features added -Some textual features can be replaced automatically using search and replace -Abbreviations, emendations and other features with formal markup -The guidelines could even describe a recommended shorthand, for which macros could be provided to automate conversions further -Special characters (yogh, ampersand, etc.) replaced by entities or elements (still not decided) -More complex textual features will be marked up by hand -Features not expressible in formal markup will be described in textual notes linked to the appropriate locus -When the textual and visual features of the manuscript originals have been encoded in XML according to the guidelines, the finished XML document can be tokenised and separated into standoff format using a series of XSL Transformations -Content, structure and various annotation layers separated either into individual documents or into separate sections within a single document 26

Demo 4: Stand-off markup -Once the textual content has been separated from the markup and tokenised into explicitly identified word units, a normalised version will be created -A semi-automatic process using a tool developed for the purpose -A high priority for the project -Allows searches to be made using normalised spellings and results returned with original spellings -Also facilitates the application of many automatic annotation tools -New layers of linguistic analysis can be created using the normalised spellings -Automatically become linked with the underlying original spellings and the encoded manuscript features -From the standoff documents, new documents containing only the desired annotation layers can be created on the fly -The material can also be converted into database form if needed 27

Demo 5: Interface mock-up -A browser-based online interface will be provided for end-users to access DECL editions -Customisable interface user can not only navigate the text but also define what aspects of it are displayed and how they will be presented -All texts will also be downloadable, both in the original XML and various other formats -With selected annotation layers -A full-fledged search engine and corpus interface will also be presented -Possibly based on the CQP-edition of BNCWeb -Will enable the combination of different DECL editions into corpora -Powerful search facilities and the ability to analyse, manipulate and download search results 28

To summarize Things to remember about DECL: Small-scale projects Retaining manuscript reality Using existing tools and standards Usability, accessibility, transparency Remember the linguist Very simple, but desperately needed -The main assets of DECL editions are retaining the editorial chain intact, and working from documented editorial principles, so that manuscript reality gets carried into the text and code of the editions. It is easier for editors of the manuscripts to tag manuscript features and normalise the text, than it is for corpus compilers working from edited material. -Increased workload for editor, but greatly increased benefits too -DECL does not compile corpora! We help the process, but DECL aims at making digital editions of historical manuscripts -Multi-purpose resources are what we want to have -The quote is from Eero Hyvönen in his opening plenary, in describing a software solution his team had created. The same can be said about DECL. 29

What we hope to achieve Increase the usefulness of digital editions Versatile digital resources Better manuscript-based historical corpora Better tools and standards Increase interdisciplinary cooperation 30

Thank You! www.helsinki.fi/varieng/domains/decl.html Samuli.Kaislaniemi@Helsinki.Fi 31