Authorship Verification with the Minmax Metric

Similar documents
COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21

EasyChair Preprint. How good is good enough? Establishing quality thresholds for the automatic text analysis of retro-digitized comics

Identifying Related Work and Plagiarism by Citation Analysis

The Weight of the Author

Collaborative Authorship in the Twelfth Century

Enhancing Music Maps

Music Genre Classification

Stylometry. Style. Discriminators. Authorship and. Stylometry. The measurement of style. Used for:

Supervised Learning in Genre Classification

A Study on Author Identification through Stylometry

Sarcasm Detection in Text: Design Document

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS

A TEXT RETRIEVAL APPROACH TO CONTENT-BASED AUDIO RETRIEVAL

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Using Genre Classification to Make Content-based Music Recommendations

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Automatic Rhythmic Notation from Single Voice Audio Sources

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Automatic Music Genre Classification

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Chasing the Ghosts of Ibsen: A computational stylistic analysis of drama in translation

Topics in Computer Music Instrument Identification. Ioanna Karydi

Chord Classification of an Audio Signal using Artificial Neural Network

Detect Missing Attributes for Entities in Knowledge Bases via Hierarchical Clustering

Some Experiments in Humour Recognition Using the Italian Wikiquote Collection

Finn s Hotel and the Joycean Canon

Computational Laughing: Automatic Recognition of Humorous One-liners

Creating a Feature Vector to Identify Similarity between MIDI Files

Lyric-Based Music Mood Recognition

Music Information Retrieval with Temporal Features and Timbre

Towards Music Performer Recognition Using Timbre Features

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

HYBRID NUMERIC/RANK SIMILARITY METRICS FOR MUSICAL PERFORMANCE ANALYSIS

Statistical Modeling and Retrieval of Polyphonic Music

Analysis and Clustering of Musical Compositions using Melody-based Features

CS229 Project Report Polyphonic Piano Transcription

BBC Three. Part l: Key characteristics of the service

The Curve of the Earth An electric score for solo network instrument with optional observations

Electrospray-MS Charge Deconvolutions without Compromise an Enhanced Data Reconstruction Algorithm utilising Variable Peak Modelling

A Stylometric Study of Nicholas of Montiéramey s Authorship in Bernard of Clairvaux s Sermones de Diversis

Overview of the SBS 2016 Mining Track

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Musical Hit Detection

MUSI-6201 Computational Music Analysis

Principal version published in the University of Innsbruck Bulletin of 4 June 2012, Issue 31, No. 314

Music Composition with RNN

Lyrics Classification using Naive Bayes

Detecting Hoaxes, Frauds and Deception in Writing Style Online

MUSICAL MOODS: A MASS PARTICIPATION EXPERIMENT FOR AFFECTIVE CLASSIFICATION OF MUSIC

The Lowest Form of Wit: Identifying Sarcasm in Social Media

Research & Development. White Paper WHP 228. Musical Moods: A Mass Participation Experiment for the Affective Classification of Music

PICK THE RIGHT TEAM AND MAKE A BLOCKBUSTER A SOCIAL ANALYSIS THROUGH MOVIE HISTORY

Lecture 10: Release the Kraken!

Hidden Markov Model based dance recognition

Feature-Based Analysis of Haydn String Quartets

USING PULSE REFLECTOMETRY TO COMPARE THE EVOLUTION OF THE CORNET AND THE TRUMPET IN THE 19TH AND 20TH CENTURIES

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Exploring the Design Space of Symbolic Music Genre Classification Using Data Mining Techniques Ortiz-Arroyo, Daniel; Kofod, Christian

An Inquiry into Authorial Attribution

Computational Modelling of Harmony

Achieve Accurate Color-Critical Performance With Affordable Monitors

Evaluating Melodic Encodings for Use in Cover Song Identification

Fall 2018 TR 8:00-9:15 PETR 106

arxiv: v1 [cs.ir] 16 Jan 2019

Common assumptions in color characterization of projectors

Iterative Direct DPD White Paper

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

The Human Features of Music.

COURSE OUTLINE DP LANGUAGE & LITERATURE

Computational Methods for Determining the Similarity between Ancient Greek Manuscripts

Mapping Interdisciplinarity at the Interfaces between the Science Citation Index and the Social Science Citation Index

Linear mixed models and when implied assumptions not appropriate

Bibliometric analysis of the field of folksonomy research

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Analytic Comparison of Audio Feature Sets using Self-Organising Maps

USING ARTIST SIMILARITY TO PROPAGATE SEMANTIC INFORMATION

Identifying Related Documents For Research Paper Recommender By CPA and COA

Variation in morphological productivity in the BNC: Sociolinguistic and methodological considerations

Composer Style Attribution

Confusing the Modern Breakthrough: Naïve Bayes Classification of Authors and Works

CIS530 Homework 3: Vector Space Models

Detecting Musical Key with Supervised Learning

Singer Traits Identification using Deep Neural Network

Resampling Statistics. Conventional Statistics. Resampling Statistics

Author Name Co-Mention Analysis: Testing a Poor Man's Author Co-Citation Analysis Method

Stratford School Academy Schemes of Work

Outline. Why do we classify? Audio Classification

Scopus. Advanced research tips and tricks. Massimiliano Bearzot Customer Consultant Elsevier

Assigning and Visualizing Music Genres by Web-based Co-Occurrence Analysis

Transportation Process For BaBar

Heuristic Search & Local Search

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Comprehensive Citation Index for Research Networks

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Towards the Generation of Melodic Structure

DOWNLOAD OR READ : MEMOIRS OF AN ORDINARY PSYCHIC PDF EBOOK EPUB MOBI

Music Performance Panel: NICI / MMM Position Statement

Module 11. Reasoning with uncertainty-fuzzy Reasoning. Version 2 CSE IIT, Kharagpur

Transcription:

Authorship Verification with the Minmax Metric Mike Kestemont University of Antwerp mike.kestemont@uantwerp.be Justin Stover University of Oxford justin.stover@classics.ox.ac.uk Moshe Koppel Bar-Ilan University moishk@gmail.com Folgert Karsdorp Radboud University Nijmegen fbkarsdorp@fastmail.nl Walter Daelemans University of Antwerp walter.daelemans@uantwerp.be Authorship studies have long played a central role in stylometry, the popular subfield of DH in which the writing style of a text is studied as a function of its author s identity. While authorship studies come in many flavors, a remarkable aspect is that the field continues to be dominated by so-called lazy approaches, where the authorship of an anonymous document is determined by extrapolating the authorship of a document s nearest neighbor. For this, researchers use metrics to calculate the distances between vector representations of documents in a higher-dimensional space, such as the well-known Manhattan city block distance. In this paper, we apply the minmax metric to the problem of authorship verification. We illustrate the broader applicability of authorship verification by reporting a high-profile case study from Classical Antiquity. The War Commentaries by Julius Caesar (100-44 BC) are a group of Latin descriptions of the military campaigns of the famous Roman statesman. While Caesar must have authored a significant portion of these commentaries himself, the exact delineation of his contribution to this important corpus remains a controversial matter. Most notably, Aulus Hirtius one of Caesar s most trusted generals is sometimes believed to have contributed significantly to the corpus.

To evaluate our verification approach, we use the procedure used in the 2014 track on authorship verification in the PAN competition on uncovering plagiarism, authorship, and social software misuse. This track focused on the open task of authorship verification in 6 data sets. Each dataset holds a number of PROBLEMS, where given (a) at least one training text by a particular target author, (b) a set of similar mini-oeuvres by other authors, and (c) a new anonymous text, the task is to determine whether or not the anonymous text was written by the target author. A system must output for each of the verification PROBLEMS a real-valued confidence score between 0.0 and 1.0. For each dataset, a fully independent training and test corpus are available (i.e. the PROBLEMS, nor authors and texts in both sets do not overlap). Systems are eventually evaluated using two scoring metrics which were also used at the PAN: the established AUC-score, as well as the so-called C@1, a variation of the traditional ACCURACY-score, which gives more credit to systems that decide to leave some difficult verification problems unanswered. As common in text classification, we vectorize the datasets under a bag-of-words assumption, which is largely ignorant of the original word order in document. We use character tetragrams below (Koppel and Winter 2014) and experiment with a number of different vector space models: - plain tf (where simple relative frequencies are used); - tf-std, where the tf-model is scaled using a feature s standard deviation in the corpus (cf. Burrows s Delta: Burrows 2002); - tf-idf, where the tf-model is scaled using a feature s inverse document-frequency (to increase the weight of rare terms). In our experiments, we include the minmax distance metric, a still fairly novel algorithm in stylometry (Koppel and Winter 2014), which calculates a real-valued distance score between two document vectors A and B: Figure 1 The minmax metric In our experiments, we make use of the General Imposters Method, a bootstrapped approach to authorship verification. We use Algorithm 1 to determine whether an anonymous text was written by the target author specified in the problem:

Figure 2 The General Imposters Algorithm During k iterations (default 100), we randomly select a sample (default 50%) of all the available features in the data set. Likewise, we randomly select m imposter documents (default 30), which were not written by the target author. Next, we use a dist() function to assess whether the anonymous text is closer to any text by the target author than to any text written by the imposters. Here, dist() represents a regular, distance metric, such as the Manhattan, Cosine or Ruzicka distance metric. The general intuition is that we do not just calculate how different two documents are; rather we test whether the stylistic differences between them are consistent (a) across many different feature sets, and (b) in comparison to other randomly, sampled documents. We compare the Imposters Approach to a strong baseline proposed by Potha and Stamatatos (2014) on a reference corpus of Latin prose from Antiquity. We will demonstrate that the imposter approach produces extremely strong results across most combinations of vector spaces and distance metrics (cf. the precision-recall curves below).

Figure 3 Precision-recall curves for the Latin benchmark corpus, using the verification system proposed by Potha and Stamatatos (2014). Figure 4 Precision-recall curves for the Latin benchmark corpus, using the imposter approach as a verification system (2014). Finally, we report the case study concerning the Corpus Caesarianum, the group of five commentaries on Caesar s military campaigns: Bellum Gallicum, Bellum civile, Bellum Alexandrinum, Bellum Africum, and Bellum Hispaniense. The first two commentaries are mainly by Caesar himself, the only exception being the final part of the Gallic War (Book 8), which is by Caesar s general Aulus Hirtius. Suetonius, writing a century and a half later, suggests that either Hirtius or another general,

named Oppius, authored the remaining works. We will report experiments which broadly supports the Hirtius s own claim that he himself compiled and edited the corpus of the non-caesarian commentaries. Figure 3, for instance, shows a heatmap-like visualisation, in which Hirtius s Book 8 of the Gallic Wars clearly clusters with the bulk of the Alexandrian War (labeled x). Figure 5 Minmax-based clustermap of 1000-word samples of the Corpus Caesarianum. References Argamon, S. (2008) Interpreting Burrows s Delta: Geometric and probabilistic foundations, Literary and Linguistic Computing, vol. 23, pp. 131 147. Burrows. J. (2002) Delta : A measure of stylistic difference and a guide to likely authorship, Literary and Linguistic Computing, vol. 17, pp. 267-287. Gaertner, J. and Hausburg, B. (2013) Caesar and the Bellum Alexandrinum: An Analysis of Style, Narrative Technique, and the Reception of Greek Historiography. Göttingen: Vandenhoeck & Ruprecht. Koppel, M. and Winter, Y. (2014) Determining if two documents are written by the same author, Journal of the Association for Information Science and Technology, vol. 65, pp. 178-187.

Mayer, M. (2011). Caesar and the corpus caesarianum. In: Marasco, G. (ed.), Political autobiographies and memoirs in antiquity: A Brill companion, pp. 189-232. Leyden: Brill. Potha, N. and Stamatatos, E. (2014) A profile-based method for authorship verification. In: Likas, A. et al. (eds.), Artificial Intelligence: Methods and Applications, volume 8445 of Lecture Notes in Computer Science, pp. 313 326. Berlin etc.: Springer International Publishing. Stamatatos, E. et al. (2014) Overview of the author identification task at PAN 2014. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014, pp. 877-897. Stover, J., Winter, Y., Koppel, M. and Kestemont, M. (2015) Computational authorship verification method attributes a new work to a major 2nd century African author, Journal of the American Society for Information Science and Technology, vol. 66, pp. 239-242.