GROBID for Humanities When engineering meets History

Similar documents
Laurent Romary. To cite this version: HAL Id: hal

Reply to Romero and Soria

Artefacts as a Cultural and Collaborative Probe in Interaction Design

Releasing Heritage through Documentary: Avatars and Issues of the Intangible Cultural Heritage Concept

PaperTonnetz: Supporting Music Composition with Interactive Paper

Influence of lexical markers on the production of contextual factors inducing irony

On the Citation Advantage of linking to data

Editing for man and machine

Workshop on Narrative Empathy - When the first person becomes secondary : empathy and embedded narrative

Open access publishing and peer reviews : new models

Compte-rendu : Patrick Dunleavy, Authoring a PhD. How to Plan, Draft, Write and Finish a Doctoral Thesis or Dissertation, 2007

Embedding Multilevel Image Encryption in the LAR Codec

Sound quality in railstation : users perceptions and predictability

QUEUES IN CINEMAS. Mehri Houda, Djemal Taoufik. Mehri Houda, Djemal Taoufik. QUEUES IN CINEMAS. 47 pages <hal >

Review of A. Nagy (2017) *Des pronoms au texte. Etudes de linguistique textuelle*

A new conservation treatment for strengthening and deacidification of paper using polysiloxane networks

Primo. Michael Cotta-Schønberg. To cite this version: HAL Id: hprints

Interactive Collaborative Books

La convergence des acteurs de l opposition égyptienne autour des notions de société civile et de démocratie

Adaptation in Audiovisual Translation

On viewing distance and visual quality assessment in the age of Ultra High Definition TV

Natural and warm? A critical perspective on a feminine and ecological aesthetics in architecture

Philosophy of sound, Ch. 1 (English translation)

A Pragma-Semantic Analysis of the Emotion/Sentiment Relation in Debates

Translation as an Art

Learning Geometry and Music through Computer-aided Music Analysis and Composition: A Pedagogical Approach

Creating Memory: Reading a Patching Language

Opening Remarks, Workshop on Zhangjiashan Tomb 247

Masking effects in vertical whole body vibrations

Translating Cultural Values through the Aesthetics of the Fashion Film

REBUILDING OF AN ORCHESTRA REHEARSAL ROOM: COMPARISON BETWEEN OBJECTIVE AND PERCEPTIVE MEASUREMENTS FOR ROOM ACOUSTIC PREDICTIONS

No title. Matthieu Arzel, Fabrice Seguin, Cyril Lahuec, Michel Jezequel. HAL Id: hal

A PRELIMINARY STUDY ON THE INFLUENCE OF ROOM ACOUSTICS ON PIANO PERFORMANCE

Spectral correlates of carrying power in speech and western lyrical singing according to acoustic and phonetic factors

Indexical Concepts and Compositionality

Negative sentence structures

Academic librarians and searchers: A new collaboration sets the path towards research project success

The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings

Motion blur estimation on LCDs

Stories Animated: A Framework for Personalized Interactive Narratives using Filtering of Story Characteristics

Copy these 2 verbs into your book:

A study of the influence of room acoustics on piano performance

Synchronization in Music Group Playing

Who s afraid of banal nationalism?

Regularity and irregularity in wind instruments with toneholes or bells

Artifactualization: Introducing a new concept.

Some problems for Lowe s Four-Category Ontology

Pseudo-CR Convolutional FEC for MCVideo

The BVH in Tours: digital library of image, text and data

A joint source channel coding strategy for video transmission

An overview of Bertram Scharf s research in France on loudness adaptation

Multipitch estimation by joint modeling of harmonic and transient sounds

Musicians on Jamendo: A New Model for the Music Industry?

From SD to HD television: effects of H.264 distortions versus display size on quality of experience

Coming in and coming out underground spaces

The Zoummeroff Collection on Criminocorpus

The Brassiness Potential of Chromatic Instruments

Clues for Detecting Irony in User-Generated Contents: Oh...!! It s so easy ;-)

Personal Response Writing

LEARN FRENCH BY PODCAST

Visual Annoyance and User Acceptance of LCD Motion-Blur

A framework for aligning and indexing movies with their script

A new HD and UHD video eye tracking dataset

Learning Opportunities for Librarians: Embarking on a Digital Humanities Project

December 2018 Language and cultural workshops In-between session workshops à la carte December weeks All levels

Can One Speak of a Perverse Social Bond?

methodology n 1 Using a dictionary

Sonic Ambiances Bruitage -Recordings of the Swiss International Radio in the Context of Media Practices and Cultural Heritage

The multimodal dining experience - A case study of space, sound and locality

Musical instrument identification in continuous recordings

AutoPRK - Automatic Drum Player

The dynamics of situations

Sound quality : a definition for a sonic architecture

Listen to the following text and repeat out loud after each sentence. Pay particular attention to the sounds ou: nous bonjour.

Under the shadow of global cinematic metropoles: the case-study of Athens

Improvisation Planning and Jam Session Design using concepts of Sequence Variation and Flow Experience

Towards Performing Arts Information As Linked Data?

Multisensory approach in architecture education: The basic courses of architecture in Iranian universities

A review of some suppressed accelerator tube installations

Ethnomusicological collections in the Sound Archives in the face of globalisation

Sentiment Aggregation using ConceptNet Ontology

Effects of headphone transfer function scattering on sound perception

Industry IoT Gateway for Cloud Connectivity

KS4 curriculum map. Year 10

Descriptive vocabulary: Il/Elle a les cheveux courts/longs. Descriptive vocabulary: Il/Elle a les yuex bleus. Nationalities: francais(e), canadien(ne)

Statistical Machine Translation from Arab Vocal Improvisation to Instrumental Melodic Accompaniment

The Prose Storyboard Language: A Tool for Annotating and Directing Movies

Olly Richards. I Will Teach You A Language COPYRIGHT 2016 OLLY RICHARDS ALL RIGHTS RESERVED

Corpus-Based Transcription as an Approach to the Compositional Control of Timbre

New directions in scholarly publishing: journal articles beyond the present

ANALYSIS-ASSISTED SOUND PROCESSING WITH AUDIOSCULPT

Perceptual assessment of water sounds for road traffic noise masking

Metonymy Research in Cognitive Linguistics. LUO Rui-feng

LEARN FRENCH BY PODCAST

A Comparative Study of Variability Impact on Static Flip-Flop Timing Characteristics

Generating Equivalent Chord Progressions to Enrich Guided Improvisation : Application to Rhythm Changes

Talking about yourself Using the pronouns je and tu. I can give several details about myself and describe a person s personality.

CRIS with in-text citations as interactive entities. Sergey Parinov CEMI RAS and RANEPA

ARE FOCUS ARE 3: Explain the sequence of events that creates geographical landforms and processes including drawing geographical sketches.

Spatial empathy and urban experience: a case study in a public space from Rio de Janeiro

Transcription:

GROBID for Humanities When engineering meets History Charles Riondet, Luca Foppiano To cite this version: Charles Riondet, Luca Foppiano. GROBID for Humanities When engineering meets History. Text as a Resource. Text Mining in Historical Science, Jun 2017, Paris, France. <hal-01585693> HAL Id: hal-01585693 https://hal.inria.fr/hal-01585693 Submitted on 11 Sep 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Distributed under a Creative Commons Attribution 4.0 International License

GROBID for Humanities When engineering meets History Charles Riondet, Luca Foppiano ALMAnaCH, Inria Paris

Acknowledgement Anne Baillot (Centre Marc Bloch, Berlin) German literature and History, Mastermind Hector Martinez Alonso (ALMAnaCH, Inria Paris) POS tagging, NLP stuff 2

Context ALMAnaCH = Automatic Language Modelling and Analysis & Computational Humanities, joint EPHE-Inria team A strange mixture of people with different skills working side by side: Computer engineering data mining, information extraction NLP experts Semantic and Syntactic analysis, Parsing, etc.. Digital Humanities (History and Literature background) data modelling, textual analysis, digital philology 3

Context ALMAnaCH = Automatic Language Modelling and Analysis & Computational Humanities, joint EPHE-Inria team A strange mixture of people with different skills working side by side: Computer engineering data mining, information extraction NLP experts Semantic and Syntactic analysis, Parsing, etc.. Digital Humanities (History and Literature background) data modelling, textual analysis, digital philology Why not combining all the skills available around? 4

Who are we? Charles Riondet - Historian WW2 data modeling Luca Foppiano - Engineer Software engineering, Data mining, Machine learning, Knowledge discovery 5

Original hermeneutical process In this situation, we started with some empirical tries Person 1: "Hey, I have data" Person 2: "Hey, I have tools" 6

Original hermeneutical process In this situation, we started with some empirical tries Person 1: "Hey, I have data" Person 2: "Hey, I have tools" Person 1 and 2 (ecstatic): "Let's try to use tools on some random data (YOLO)" 7

Original hermeneutical process But with a research question, it worked a little better: Person 1: "Hey, I have data." Person 2: "Hey, I have tools." Person 1: "Waow, I also have an idea of what I want to do" Person 2: "Maybe my tools could help you shedding some light" 8

Original hermeneutical process Next step, we started to think about a win-win strategy" Person 1: "Hey, I have data and a research question" Person 2: "Hey, I have tools" Person 1: "Your tools might need new data and use cases to improve (ensure genericity and cross-domain applicability)" 9

Original hermeneutical process Finally Person 1: "Hey, I have data." Person 2: "Hey, I have tools." Person 3: "Hey guys, I also have data, tools and some nice research questions" Person 4: "I want to be part of the group, please" Choir : "Let's create a research project all together" 10

ECRPER project (ANR/DFG, 2018-2021) Franco-german personal writings in wartime (19th-20th century) With ALMAnaCH (Inria Paris) - DHI (Paris) - Centre Marc Bloch (Berlin) - IEP Lille Analysing German and French diaries and letters written during the French-German wars since the 19th century: - Napoleonic Wars - War of 1870 - First World War - Second World War Funding is not granted yet, but we're in a hurry, so we already started to work. 11

ECRPER in short 1) Diachronic and synchronic analysis: how the Self and the Other are being represented how relationship to events is being elaborated through personal writings 2) Bring together different tools, sets up a solid editorial and hermeneutical workflow and make it available for further research. OCR Annotation Edition publication 12

ECRPER in short 1) Diachronic and synchronic analysis: how (the Self and) the Other are being represented how relationship to events is being elaborated through personal writings 2) Bring together different tools, sets up a solid editorial and hermeneutical workflow and make it available for further research. OCR Annotation Edition 13

Representation of the other The representation of the other in the context of war discourses How Germans see the French and reciprocally. Broaden to all the mentions of the conflicts actors (Nations, organisations, persons, person types, ) All instances appearing in the discourse will be modelized and brought in relation to one another. Diachronic analysis: appearance and disappearance of a mention, semantic evolution, spelling variations,... 14

Representation of the other: NLP approach The other/the actors of the conflicts a set of Named-entities Recognize structural discourse elements in French and in German Part of speech tagging Opinion mining lexical analysis of the context of the mentions likely to vary according to time and nation + social origin of the writer. 15

The big picture Part of speech tagging NER classes: Named Entity LOCATION, Text Recognition and PERSON, + Disambiguation Domains: History, Military,... Output Wikipedia Domain specific dictionary 16

(N)ERD, a tool for Named Entity Recognition and Disambiguation GROBID (N)ERD is a tool for recognise and extract named entities from text or PDF documents. They are then resolved (disambiguated) against Wikipedia. E.g. The president Washington went to Washington to celebrate his birthday. 17

Overview of the GROBID Family GROBID-NER GROBID GROBID (or Grobid) means GeneRation Of BIbliographic Data. Written by Patrice Lopez and released open source (Licence Apache 2.0). (N)ERD continue... Available on Github: - http://github.com/kermitt2/grobid - http://github.com/kermitt2/grobid-ner - http://github.com/kermitt2/nerd - [...] GROBID-ASTRO GROBID-QUANTITIES GROBID-DICTIONARIES 18

Part of speech tagging (in one slide) Syntactic analysis of sentences, produces dependency tree and tags (noun, verb, adj, etc.) [1] https://www.slideshare.net/vseloved/crash-course-in-natural-language-processing-2016 [2] http://naviglinlp.blogspot.fr/2017/04/lecture-7-part-of-speech-tagging.html 19

The corpus: WW2 diaries written in French Journal de Léo Hamon (Archives nationales, 72AJ42) French Lawyer of Russian origin, one of the leaders of Parisian Resistance. underground daily life, reports on meetings, comments on the course of the war and on the organization of the Resistance and on the preparation of the seizure of power in Paris. Journal d'henri Chabasse (Musée de la Résistance nationale, 13/3907b) Nationalist middle-class Parisian, not involved in the Resistance nor in the Collaboration. Daily life but mostly comments on the course of the war and French political situation, from D-Day to fall 1944 (+ an entry related to Hiroshima bombing) 20

Need a specific dictionary Context based expressions Ex: "Souris grises" German army female auxiliaries (not the animal) Diary internal terminology Ex: "les cocs" Members of the communist party (not the animal badly written) Unnormalized spelling Ex: Gaulisme Gaullisme Metonymies Ex: Vichy Régime de Vichy (not Vichy town) 21

Need a specific dictionary Because Wikipedia doesn't know everything. Sources: Marcot et al., Dictionnaire historique de la Résistance, Paris, R. Laffont, 2006. Rue de la Mémoire, Volksbund Deutsche Kriegsgräberfürsorge e.v., 2016, Parler de l'histoire et de la Mémoire. Première et Deuxième guerre mondiale. Glossaire Franco-allemand. Specific terms found in the diaries 22

Need a specific dictionary Modeled in TEI-TBX (thanks to Stefan Pernes) Machine-readable Standard 23

4 main steps 1) Extract Named Entities 2) Apply domain specific dictionary 3) POS tagging 4) Mixing up everything (POS and NER) 24

Workflow (1) - Extract Named Entities Nous parlons du procès Pucheu. La question est plus actuelle. (...) J'indique qu'à mon avis tout ce procès a été mal conduit - il fallait (...) proclamer que devant des crimes inouïs, (...) la nation prenait une décision politique, immoler les hommes de la haute trahison : "Jetons à l'europe, en défi, une tête de roi", jetons aux combinards de Vichy et de Washington, en défi, une tête de traître. (...) PERSON PERSON LOCATION Je vois Yves qui m'avait cité l'impatience d'alger devant le cas Pucheu comme ORGANISATION une illustration de la crise du Gaulisme. LOCATION LOCATION Journal de Léo Hamon, March 14th 1944 entry PERSON 25

Workflow (2) - Apply domain specific dictionary Nous parlons du procès Pucheu. La question est plus actuelle. (...) Régime de Vichy État Français United States of America aux combinards de Vichy et de Washington, en défi, une tête de traître. (...) Je vois Yves qui m'avait cité l'impatience d'alger devant le cas Pucheu comme une illustration de la crise du Gaulisme. Journal de Léo Hamon, March 14th 1944 Comité français de Libération nationale Movement inspired by De Gaulle 26

Resolving ambiguities - Vichy 27

Resolving ambiguities GROBID (N)ERD supports context customisation The customisation is a way to specialize the entity resolution for a particular domain/profile, for example selecting the correct Vichy wikipedia entry related to the ww2 context. Specific dictionaries can then be integrated seamlessly in the tool. NOTE: When wikipedia doesn t provide a page for the alternative meaning, this approach cannot be applied. 28

Workflow (1) - Extract Named Entities LOCATION LOCATION Place de la République, Hotel Moderne, vaste bâtisse où étaient logées ANIMAL ANIMAL NATIONAL les petites souris grises, d autres disent «les Salamandres», jeunes allemandes en uniforme. Elles partent et elles ne pouvaient emporter qu un léger bagage à la main. (...) Journal d'henri Chabasse, August 13th 1944 29

Adjusting entity resolution No wikipedia page, no party. The dictionary is then used to override the resolution phase to force the correct definition. This approach is very specific, less frequent and should be only a fallback solution E.g. presence of nicknames in the text 30

Workflow (2) - Combining dictionaries Place de la République, Hotel Moderne, vaste bâtisse où étaient logées les petites souris grises, d autres disent «les Salamandres», jeunes allemandes en uniforme. Elles partent et elles ne pouvaient emporter qu un léger bagage à la main. (...) German army female auxiliaries Journal d'henri Chabasse, August 13th 1944 31

Workflow (3) - POS tagging and parsing Nous devons jeter aux combinards de Vichy et de Washington, en défi, une tête de traître. The dependency tree can be used to find expressions and modifiers relative to the entities of interest. E.g. named entity "Vichy" (i.e. Régime de Vichy) and Washington (i.e. the US government) nominal modifier "combinards" 32

Workflow (4) - mixing everything Nous parlons du procès Pucheu. La question est plus actuelle. (...) J'indique qu'à mon avis tout ce procès a été mal conduit - il fallait (...) proclamer que devant des crimes inouïs, (...) la nation prenait une décision politique, immoler les hommes de la haute trahison : "Jetons à l'europe, en défi, une tête de roi", jetons aux combinards de Vichy et de Washington, en défi, une tête de traître. (...) Je vois Yves qui m'avait cité l'impatience d'alger devant le cas Pucheu comme une illustration de la crise du Gaulisme. Journal de Léo Hamon, March 14th 1944 entry 33

Workflow (5) - expected results German Army female auxiliaries Vichy Souris grises Salamandres combinard traître pantin Jeunes allemandes en uniforme List of Named entity modifiers and expressions Onomasiological terminology 34

Next steps Apply this workflow to a very large and diverse corpus Multilingual terminology with the specific terms for each war Add sentiment polarity tagging layer Nice results visualisation 35

Thank you 36