Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Similar documents
Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Detecting Attempts at Humor in Multiparty Meetings

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Automatic Laughter Detection

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Reducing False Positives in Video Shot Detection

Producer Packet Project

Audio-Based Video Editing with Two-Channel Microphone

Wipe Scene Change Detection in Video Sequences

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Automatic Laughter Segmentation. Mary Tai Knox

Improving Frame Based Automatic Laughter Detection

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

Singer Traits Identification using Deep Neural Network

A Framework for Segmentation of Interview Videos

MUSI-6201 Computational Music Analysis

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

Case Study Monitoring for Reliability

Hidden Markov Model based dance recognition

Automatic Laughter Detection

Automatic Rhythmic Notation from Single Voice Audio Sources

Topics in Computer Music Instrument Identification. Ioanna Karydi

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Development of a wearable communication recorder triggered by voice for opportunistic communication

2.4 Scheduled Events Schedule Recording. 1. In Multicam, click on the Schedule icon. 2. Select Schedule Edit.

Multimodal databases at KTH

Automatic Music Clustering using Audio Attributes

Basic Operations App Guide

Phone-based Plosive Detection

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A repetition-based framework for lyric alignment in popular songs

Lecture 15: Research at LabROSA

Singer Identification

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Name Identification of People in News Video by Face Matching

Fision. User Guide. Sharper. Faster. Smarter with Fiber Optics ver5_0616 CALL

Multi-modal Analysis for Person Type Classification in News Video

UNIVERSITY OF DUBLIN TRINITY COLLEGE

Cisco StadiumVision Defining Channels and Channel Guides in SV Director

A Video Browser that Learns by Example

Classification of Timbre Similarity

WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf

Supervision of Analogue Signal Paths in Legacy Media Migration Processes using Digital Signal Processing

A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov models Trained on Synthesized Audio

Retrieval of textual song lyrics from sung inputs

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Supervised Learning in Genre Classification

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Detection of Panoramic Takes in Soccer Videos Using Phase Correlation and Boosting

Transcription of the Singing Melody in Polyphonic Music

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Acoustic Scene Classification

PREDICTING HUMOR RESPONSE IN DIALOGUES FROM TV SITCOMS. Dario Bertero, Pascale Fung

Laugh when you re winning

Music Information Retrieval

USER GUIDE. Get the most out of your DTC TV service!

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Music Recommendation from Song Sets

MOVIES constitute a large sector of the entertainment

Computational Modelling of Harmony

LAUGHTER serves as an expressive social signal in human

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Will Anyone Really Need a Web Browser in Five Years?

Jazz Melody Generation and Recognition

Proposal for Application of Speech Techniques to Music Analysis

Analysis of the Occurrence of Laughter in Meetings

Semi-supervised Musical Instrument Recognition

Perceptual dimensions of short audio clips and corresponding timbre features

STATEMENT OF INTERNATIONAL CATALOGUING PRINCIPLES

Digital TV. User guide. Call for assistance

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Enhancing Music Maps

Speech and Speaker Recognition for the Command of an Industrial Robot

Methods for the automatic structural analysis of music. Jordan B. L. Smith CIRMMT Workshop on Structural Analysis of Music 26 March 2010

User's Guide. Version 2.3 July 10, VTelevision User's Guide. Page 1

Introduction. The following draft principles cover:

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

The Kaffeine Handbook. Jürgen Kofler Christophe Thommeret Mauro Carvalho Chehab

Wilkes Repair: wilkes.net River Street, Wilkesboro, NC COMMUNICATIONS

Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

Shades of Music. Projektarbeit

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

CTP431- Music and Audio Computing Music Information Retrieval. Graduate School of Culture Technology KAIST Juhan Nam

Automatic Construction of Synthetic Musical Instruments and Performers

Subjective Similarity of Music: Data Collection for Individuality Analysis

Automatic discrimination between laughter and speech

Predicting Performance of PESQ in Case of Single Frame Losses

Content-based music retrieval

Sensor-Based Analysis of User Generated Video for Multi-camera Video Remixing

Evaluation of Automatic Shot Boundary Detection on a Large Video Test Suite

VivoSense. User Manual Galvanic Skin Response (GSR) Analysis Module. VivoSense, Inc. Newport Beach, CA, USA Tel. (858) , Fax.

Smart Traffic Control System Using Image Processing

User s Manual. Network Board. Model No. WJ-HDB502

Transcription:

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina

What? Novel method to generate indexing information for the navigation of TV content

Why? Lots of different ways to watch videos DVD, Blu-ray On-demand Internet Lots of videos out there! Need better ways to navigate content Show a particular scene Show where a favorite actor talks Support random seek into videos

Example: Sitcoms Specifically Seinfeld Strict set of rules Every scene transition is marked by music Every punchline marked by artificial laughter Video: http://www.youtube.com/watch?v=papxssk6zqa

Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

Joke-O-Mat Original system (2008-2009) Ability to navigate basic narrative elements: Scenes Punchlines Dialog segments Per-actor filter Ability to skip certain parts Surf the episode "Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp. 511-516

Joke-O-Mat Two main elements: 1. Pre-processing step 2. Online video browser:

Joke-O-Mat Two main elements: 1. Pre-processing and analysis step

Acoustic Event & Speaker Identification Goal: Train GMMs for different audio events Jerry, Kramer, Elaine, George Male & female supporting actor Laughter Music Non-speech (i.e. other noises) Use 1-minute audio sample Compute 19-dim MFCCs Train 20-component GMMs

Audio Segmentation Given the trained GMMs 2.5 sec * 10ms = 250 frames Compute likelihood for each set of features for each GMM Use majority vote to classify to either speakers or laughter/ music/non-speech

Narrative Theme Analysis Transforms acoustic event segmentation and speaker detection into narrative theme segments Rule-based system: Dialog = single contiguous speech segment Punchline = dialog + laughter Top-5 punchlines = 5 punchlines followed by the longest laughter Scene = segment of at least 10 sec between two music events

Narrative Theme Analysis Creates icons for the GUI Sitcom rules: actor has to be shown once a certain speaking time is exceeded Median frame of the longest speech segment for each actor Could use a visual approach here.. Use median frame for other events (scene, punchlines, dialog)

Online Video Browser Shows video Allows for play/pause, seeking to random positions Navigational panel allows to browse directly to: Scene Punchline Top-5 punchlines Dialog element Select/deselect actors http://www.icsi.berkeley.edu/jokeomat/hd/auto/ index.html

Evaluation Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme Analysis Total 10% real-time 2.5min 7.7min Diarization Error Rate (DER) = 46% 5% per class Winner of the ACM Multimedia Grand Challenge 2009

Limitations of the original Joke-O-Mat Requires manual training of speaker models Requires 60 seconds of training data for each speaker Cannot support actors with minor roles Does not take into account what was said

Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

Extended System Enhanced Joke-O-Mat (2010) + Speech Recognition Keyword search Automatic alignment of speaker ID and ASR with: Fan-generated scripts Closed captions Significantly reduces manual intervention

New Joke-O-Mat System

New Joke-O-Mat System

Context-Augmentation Producing transcripts can be costly Luckily we have the Internet! Scripts and closed captions produced by fans

Fan-generated data Fan-sourced scripts Tend to be very accurate However, don t contain any time information Closed captions Contain time information However, do not contain speaker attribution Less accurate, often intentionally altered Normalize and merge them together

Fan-generated data Normalize the scripts and the closed captions Then, use minimum edit distance to align two sources Start & End words in script = Start & End words in caption Use timing from the closed caption, speaker from the script If one speaker = single-single speaker segment If multiple speakers = multi-speaker segment (37.3%)

Forced Alignment + = Audio Transcript Alignment Generate detailed timing information for each word Perform all steps of a speech recognizer on the audio But, instead of using a language model, use only the transcript sequence of words Also does speaker adaptation over segments Will be more accurate on speaker-homogeneous segments

Forced Alignment Run forced alignment on each segment For 10 episodes tested 90% of the segments aligned at the first step Start time & end time of each word Speaker attribution

Forced Alignment Pool segments for each speaker and train speaker models + train a garbage model On audio that falls between the segments Assume that contain only laughter, music, and other non-speech

Forced Alignment For the failed single-speaker segments: Still use segment start and end time Don t have a way to index exact temporal location of each word For each failed multi-speaker segment: Generate a HMM alternating: Speaker states Garbage states

Forced Alignment For each time step, advance an arc and collect probability Ex: if move across Patrice arc, invoke Patrice speaker model at that time step Segmentation = most probable path through the HMM Garbage model allows for arbitrary noise between speakers Minimum duration for each speaker In reality, system was not sensitive the the duration

Forced Alignment Multi-speaker segments => many single-speaker segments Run the forced alignment with ASR again

Music & Laughter Segmentation Laughter decoded using Shout speech/nonspeech decoder Music models are trained separately (same as the original Joke-O-Mat)

Putting it all together http://www.icsi.berkeley.edu/jokeomat/hd/auto/ index.html

Evaluation Compare to expert-annotated ground truth 1. DER False alarms: closed captions spanning multiple dialog segments Missed speech: truncation of words in forced alignment

Evaluation Compare to expert-annotated ground truth 2. User Study 25 participants Randomly showed expert- and fan-annotated episodes Asked to state preference

Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

Limitations & Future Work Laughter and scene transition music manually trained Require scripts and closed captions Available from show producers Failed single-speaker segments how to handle? Retrain speaker models HMM for the whole episode Look at other genres (dramas, soap operas, lectures?) New rules Add visual data

Thanks!