Climbing the Tower of Babel Challenges and Opportunities in Multilingual Data for the Digital Humanities

Similar documents
Susan K. Reilly LIBER The Hague, Netherlands

POETS, ROBOTS AND LANGUAGE MACHINES Belen Gache

Appendix F: CDLC S Expanded Subject Metadata Fields

By submitting this essay, I attest that it is my own work, completed in accordance with University regulations. Caroline Sydney

Europeana DCHE. 11 May 2017 Jill Cousins, Harry Verwayen, Shadi Ardalan

Electronic Database Guides

Do we still need bibliographic standards in computer systems?

Europeana Foundation Governing Board Meeting

DOWNLOAD PDF ENGLISH-SLOVAK DICTIONARY OF LIBRARY TERMINOLOGY

LIST OF PUBLISHED STANDARDS

LIBER Road Map towards Digitisation

EUROPEAN COMMISSION Directorate-General for Communications Networks, Content and Technology

(Presenter) Rome, Italy. locations. other. catalogue. strategy. Meeting: Manuscripts

SUBJECT DISCOVERY IN LIBRARY CATALOGUES

Cataloging with a Dash of RDA. Part one of Catalogers cogitation WNYLRC, June 20, 2016 Presented by Denise A. Garofalo

RDA is Here: Are You Ready?

Revitalising Old Thoughts: Class diagrams in light of the early Wittgenstein

Understanding. Subjects. and. Subject. Proxies 1

Can we make all 78 rpm records available on the Internet? Pekka Gronow BAAC, Vilnius,

Title: Documentation for whom?

An introduction to RDA for cataloguers

The Biblissima Portal

A 21st century look at an ancient concept: Understanding FRBR,

Case study 1: Google Books at the Complutense University of Madrid CERL Annual Seminar 2012 October , British Library

CLARIN AAI Vision. Daan Broeder Max-Planck Institute for Psycholinguistics. DFN meeting June 7 th Berlin

WORLD LIBRARY AND INFORMATION CONGRESS: 75TH IFLA GENERAL CONFERENCE AND COUNCIL

Christian Aliverti, Head of the Section of Bibliographic Access at the Swiss National Library, Librarian. Member of the Management Board of the Swiss

COLLECTION DEVELOPMENT POLICY OF THE NATIONAL LIBRARY OF FINLAND

language and reality. some aspects of realism in the philosophy of language

Library Field Trip: An Expedition to the Lafayette College Skillman Library

The Eight SEEDI conference Digitisation of cultural and scientific heritage Zagreb, 15 16th May 2013.

Ref.: Tel.: Fax: January 2014

Development and Principles of RDA. Daniel Kinney Associate Director of Libraries for Resource Management. Continuing Education Workshop May 19, 2014

STANDARDISATION MANDATE TO THE CEN ON THE HARMONISATION OF

A Gateway to Film Heritage in Europe

The European Film Gateway. September 2008 August Project presentation. Cofunded by the Community Programme econtentplus

IGeLU 2017 Content conversations

Renovating Descriptive Practices: A Presentation for the ARL Fellows. Karen Calhoun OCLC Vice President WorldCat & Metadata Services November 1, 2007

Old Fort St. Joseph or Michigan Under Four Flags

DDC22. Dewey at ALA Annual. Dewey Decimal Classification News

Ideology the Metalanguage of Culture

What it is and what you need to know. Outline

ENCYCLOPEDIA DATABASE

Signatures of All Things I am Here to Read : Digital Research as Practice, Digital Networks as Public Engagement

Searching for Primary Documents

Illinois Statewide Cataloging Standards

Europeana Core Service Platform

Les descripteurs des bases iconographiques Mandragore (BnF) et Initiale (IRHT) dans le portail Biblissima

Amazon: competition or complement to OPACs Maja Žumer University of Ljubljana, Slovenia

Building Blocks for the Future: Making Controlled Vocabularies Available for the Semantic Web

The Google Scholar Revolution: a big data bibliometric tool

Building Blocks for the Future: Making Controlled Vocabularies Available for the Semantic Web

VRAcore:

CONTEMPORARY TENDENCES IN SERBIAN ACADEMIC LIBRARIANSHIP WITH SPECIAL EMPHASIS ON CATALOGUING AND CLASSIFYING LIBRARY MATERIALS

WEB OF SCIENCE THE NEXT GENERATAION. Emma Dennis Account Manager Nordics

Resource discovery Maximising access to curriculum resources

THE UNIVERSITY OF THE WEST INDIES

A Role for Classification: The Organization of Resources on the Internet

Today s WorldCat: New Uses, New Data

Guidelines for Subject Access. in National Bibliographies

EOD and 20th century s digitisation desert: can we make it bloom? Silvia Gstrein, University of Innsbruck Tartu, University Library 7 June 2013

The John Kinder Theological Library. Using library resources effectively to support your study

ENSC 105W: PROCESS, FORM, AND CONVENTION IN PROFESSIONAL GENRES

The EU and film archives

Information Standards Quarterly

Basic Copy Cataloging (Books) Goals

News From OCLC Compiled by Susan Westberg SAA Annual, Boston, Massachusetts, August 2004

Welsh print online THE INSPIRATION THE THEATRE OF MEMORY:

6JSC/Chair/8/DNB response 4 October 2013 Page 1 of 6

A portal for film archives in Europe - The European Film Gateway

RDA and cultural heritage - a new starting point for international cooperation?

Converting and Reconciling. Sally McCallum, Library of Congress European BIBFRAME Workshop Florence, Italy September 2018

EFG1914: FINAL PUBLIC PROGRESS REPORT

ITU-T Y.4552/Y.2078 (02/2016) Application support models of the Internet of things

EUROPEAN COMMISSION Directorate-General for Communications Networks, Content and Technology

The digital Beethoven house

Effects of Civil War Pathfinder

DDC22. Dewey at ALA Midwinter. Dewey Decimal. Classification News

Memory of the World. United Nations Educational, Scientific and Cultural Organization. The Documentary Heritage of TIMOR LESTE

CONSIGLIO REGIONALE DELLA PUGLIA

Authority Control -- Key Takeaways & Reminders

Web of Science Unlock the full potential of research discovery

Cataloging Principles: IME ICC

Hearing on digitisation of books and copyright: does one trump the other? Tuesday 23 March p.m p.m. ASP 1G3

Defining DTTB network specifications and ensuring Quality of Service

Panel 2 How to best recognise orphan status

"Libraries - A voyage of discovery" Connecting to the past newspaper digitisation in the Nordic Countries

Media and Data Converging Media and Content

NDL s Digital Collection and Service for Information Access

Date submitted: June 30, 2011 Françoise Bourdon

Thematic Collections on Europeana: a one-stop-shop for storytellers

Defining National Solutions for Managing Book Collections and Improving Digital Access

RDA RESOURCE DESCRIPTION AND ACCESS

Bodleian Libraries U N I V E R S I T Y O F OX F O R D.

A Gateway to Film Heritage in Europe

Discovery has become a library buzzword, but it refers to a traditional concept: enabling users to find library information and materials.

1. Controlled Vocabularies in Context

Introduction: Use of electronic information resources

Keeping up with Current Research: Michaelmas 2015

How to find a book or manual

Transcription:

Climbing the Tower of Babel Challenges and Opportunities in Multilingual Data for the Digital Humanities 7th LIDER Roadmapping Workshop Linked Data for Digital Humanities and Linguistics 20 October 2015, Madrid Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker

La búsqueda de la lengua perfecta Umberto Eco, 1994 I certainly will never advise to follow the bizarre thought presented here and dream of a universal language

How many languages are there? The Holy Bible, 1. Mose 10: 72 (70) Max Planck Institute for Evolutionary Anthropology: 6500 7000 ISO 639-3: 7704 (ISO 639-2: 450) Google Translate supported: 90 Europeana content: currently 50

Metadata To enjoy a painting or music on Europeana, no special language skills are required? Wrong! Cultural objects are described using metadata Metadata comes in different languages (country of origin of the data provider) Most often metadata does not have language information How to still find what you are looking for?

Problem: Metadata Example: Subject Philosophy Philosophie Filosofía Filosofie Filosofija Heimspeki Филозофија Etc.

Metadata: Option 1 Indicate the language of the metadata This supports the use of translation or mapping tools to find the correct term in other languages/controlled vocabularies Example: <subject language= English >Philosophy</subject>

Europeana Query Translation

Europeana Query Translation How it works: http://www.europeana.eu/portal/s earch.html?query=philosophy

Europeana Query Translation http://[language].wikipedia.org/w/api.php?action=query&prop=langlinks&form at=json&titles=[query term] {"lang":"de", :"Philosophie"}, {"lang":"es", :"Filosofía"}

Europeana Query Translation http://www.europeana.eu/portal/s earch.html?query=philosophy&phil osophie&filosofía (simplified for illustration purposes above query does not really work, as the query expansion is done internally)

Europeana Query Translation Read more: Query Translation in Europeana: http://journal.code4lib.org/articles/10285 Improving Europeana Multilingual Search: http://blog.europeana.eu/2014/08/improvingsearch-across-languages/

El idioma analítico de John Wilkins Jorge Luis Borges, Otras Inquisiciones Theoretically, it is not impossible to think of a language where the name of each thing says all the details of its destiny, past and future

Metadata: Option 2 Even better: Use a language-independent identifier for subject classification (e.g. Library of Congress, WikiData, DDC) Example: <subject id= loc >sh85100849</subject> <subject id= wikidata >Q5891</subject>

Two examples Europeana 1914 1918 http://www.europeana1914-1918.eu/ Europeana Newspapers http://www.europeana-newspapers.eu/

Europeana 1914-1918 In fact, three projects: Europeana Collections 1914-1918 400.000 digitised items from World War I Europeana 1914-1918 User generated content from World War I European Film Gateway 1914 740 hours of film related to World War I How to present these as a uniform collection?

Europeana 1914-1918 Analysis of subject classifications available at content holding institutions, e.g. catalogues

Europeana 1914-1918 Ranking of most frequent subjects Subject Heading Count World War, 1914-1918--Campaigns 4307 World War, 1914-1918--Trench warfare 2990 World War, 1914-1918--Transportation 2171 World War, 1914-1918--Caricatures and cartoons 2013 World War, 1914-1918--Serbia 1755

Europeana 1914-1918 Mapping subjects to LoC identifiers Subject Heading World War, 1914-1918--Campaigns World War, 1914-1918--Trench warfare World War, 1914-1918--Transportation World War, 1914-1918--Caricatures and cartoons World War, 1914-1918--Serbia LoC identifier sh85148240 sh2008113804 sh2008113817 sh2010119466 Sh2008113856

Europeana 1914-1918 Enrichment of metadata with LCSH identifiers

Europeana 1914-1918 Translation of all subjects

Europeana Newspapers Full text collection of 12 million digitised newspaper pages from 23 European libraries Around 40 different languages overall Newspapers from 1618-1990 historical spelling variants! www.theeuropeanlibrary.org/tel4/newspapers

Europeana Newspapers Content in Europeana Newspapers

Europeana Newspapers 12 million newspaper pages = approximately 102,000,000,000 words! Impossible to translate everything to multiple languages But there are alternatives

Europeana Newspapers What if it were possible to search for persons, locations, events, across languages? Siege of Przemyśl

Europeana Newspapers Named Entity Recognition University of Stanford NER toolkit

Europeana Newspapers Named Entity Disambiguation Jordan Comparison of context

Europeana Newspapers Named Entity Linking freebase.com/m/054c1 lccn.loc.gov/n92121379 wikidata.org/wiki/q41421

What if All metadata in Europeana 1914-1918 had language-independent identifiers All entities in Europeana Newspapers had language-independent identifiers It should be possible to link the two distinct collections!

Research Questions This would allow for some very intersting digital humanities research questions, e.g. How were World War I events covered in newspapers of different nations across Europe? What were the relations between persons, places and events during World War I?

The Republic of Letters http://stanford.edu/group/toolingup/rplviz/rp lviz.swf

Global Database of Events, Language and Tone http://www.gdeltproject.org/

Conclusion We need know-how and technologies for multilinugual linking of objects across cultural heritage organisations and digital collections We need guidelines and standards that support the creation and provision of metadata in cultural heritage objects as multilingual linked data

To follow up Europeana White Paper on Best Practices for Multilingual Access to Digital Libraries W3C Community Group Best Practices for Multilingual Linked Open Data Europeana Connect - Multilinguality

Tractatus Logico-Philosophicus Ludwig Wittgenstein, 1922 Proposition 7: Whereof one cannot speak, thereof one must be silent

Thank you for you attention! 7th LIDER Roadmapping Workshop Linked Data for Digital Humanities and Linguistics 20 October 2015, Madrid Clemens Neudecker Staatsbibliothek zu Berlin @cneudecker