Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Size: px

Start display at page:

Download "Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts"

May Bradford
5 years ago
Views:

1 Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland, Luke Gottlieb, Adam Janin International Computer Science Institute (ICSI) Presented by: Katya Gonina

2 What? Novel method to generate indexing information for the navigation of TV content

3 Why? Lots of different ways to watch videos DVD, Blu-ray On-demand Internet Lots of videos out there! Need better ways to navigate content Show a particular scene Show where a favorite actor talks Support random seek into videos

4 Example: Sitcoms Specifically Seinfeld Strict set of rules Every scene transition is marked by music Every punchline marked by artificial laughter Video:

5 Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

6 Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

Joke-O-Mat Original system (2008-2009) Ability to navigate basic narrative elements: Scenes Punchlines Dialog

Identification for Narrative-Theme Navigation of Seinfeld Episodes G. Friedland, L. Gottlieb, and A.

7 Joke-O-Mat Original system ( ) Ability to navigate basic narrative elements: Scenes Punchlines Dialog segments Per-actor filter Ability to skip certain parts Surf the episode "Using Artistic Markers and Speaker Identification for Narrative-Theme Navigation of Seinfeld Episodes G. Friedland, L. Gottlieb, and A. Janin Proceedings of the 11th IEEE International Symposium on Multimedia (ISM2009), San Diego, California, pp

8 Joke-O-Mat Two main elements: 1. Pre-processing step 2. Online video browser:

9 Joke-O-Mat Two main elements: 1. Pre-processing and analysis step

10 Acoustic Event & Speaker Identification Goal: Train GMMs for different audio events Jerry, Kramer, Elaine, George Male & female supporting actor Laughter Music Non-speech (i.e. other noises) Use 1-minute audio sample Compute 19-dim MFCCs Train 20-component GMMs

11 Audio Segmentation Given the trained GMMs 2.5 sec * 10ms = 250 frames Compute likelihood for each set of features for each GMM Use majority vote to classify to either speakers or laughter/ music/non-speech

Narrative Theme Analysis Transforms acoustic event segmentation and speaker detection into narrative theme segments Rule-based system: Dialog = single contiguous

12 Narrative Theme Analysis Transforms acoustic event segmentation and speaker detection into narrative theme segments Rule-based system: Dialog = single contiguous speech segment Punchline = dialog + laughter Top-5 punchlines = 5 punchlines followed by the longest laughter Scene = segment of at least 10 sec between two music events

13 Narrative Theme Analysis Creates icons for the GUI Sitcom rules: actor has to be shown once a certain speaking time is exceeded Median frame of the longest speech segment for each actor Could use a visual approach here.. Use median frame for other events (scene, punchlines, dialog)

14 Online Video Browser Shows video Allows for play/pause, seeking to random positions Navigational panel allows to browse directly to: Scene Punchline Top-5 punchlines Dialog element Select/deselect actors index.html

15 Evaluation Phase Performance For 25min Episode Training 30% real-time 2.7min Classification 10% real-time 2.5min Narrative Theme Analysis Total 10% real-time 2.5min 7.7min Diarization Error Rate (DER) = 46% 5% per class Winner of the ACM Multimedia Grand Challenge 2009

16 Limitations of the original Joke-O-Mat Requires manual training of speaker models Requires 60 seconds of training data for each speaker Cannot support actors with minor roles Does not take into account what was said

17 Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

18 Extended System Enhanced Joke-O-Mat (2010) + Speech Recognition Keyword search Automatic alignment of speaker ID and ASR with: Fan-generated scripts Closed captions Significantly reduces manual intervention

19 New Joke-O-Mat System

20 New Joke-O-Mat System

21 Context-Augmentation Producing transcripts can be costly Luckily we have the Internet! Scripts and closed captions produced by fans

22 Fan-generated data Fan-sourced scripts Tend to be very accurate However, don t contain any time information Closed captions Contain time information However, do not contain speaker attribution Less accurate, often intentionally altered Normalize and merge them together

Fan-generated data Normalize the scripts and the closed captions Then, use minimum edit distance to align two sources Start & End words in script = Start & End words

23 Fan-generated data Normalize the scripts and the closed captions Then, use minimum edit distance to align two sources Start & End words in script = Start & End words in caption Use timing from the closed caption, speaker from the script If one speaker = single-single speaker segment If multiple speakers = multi-speaker segment (37.3%)

But, instead of using a language model, use only the transcript sequence of words

24 Forced Alignment + = Audio Transcript Alignment Generate detailed timing information for each word Perform all steps of a speech recognizer on the audio But, instead of using a language model, use only the transcript sequence of words Also does speaker adaptation over segments Will be more accurate on speaker-homogeneous segments

25 Forced Alignment Run forced alignment on each segment For 10 episodes tested 90% of the segments aligned at the first step Start time & end time of each word Speaker attribution

26 Forced Alignment Pool segments for each speaker and train speaker models + train a garbage model On audio that falls between the segments Assume that contain only laughter, music, and other non-speech

27 Forced Alignment For the failed single-speaker segments: Still use segment start and end time Don t have a way to index exact temporal location of each word For each failed multi-speaker segment: Generate a HMM alternating: Speaker states Garbage states

28 Forced Alignment For each time step, advance an arc and collect probability Ex: if move across Patrice arc, invoke Patrice speaker model at that time step Segmentation = most probable path through the HMM Garbage model allows for arbitrary noise between speakers Minimum duration for each speaker In reality, system was not sensitive the the duration

29 Forced Alignment Multi-speaker segments => many single-speaker segments Run the forced alignment with ASR again

30 Music & Laughter Segmentation Laughter decoded using Shout speech/nonspeech decoder Music models are trained separately (same as the original Joke-O-Mat)

31 Putting it all together index.html

32 Evaluation Compare to expert-annotated ground truth 1. DER False alarms: closed captions spanning multiple dialog segments Missed speech: truncation of words in forced alignment

33 Evaluation Compare to expert-annotated ground truth 2. User Study 25 participants Randomly showed expert- and fan-annotated episodes Asked to state preference

34 Outline 1. Original Joke-O-Mat (2009) - System setup - Evaluation - Limitations 2. Enhanced version (2010) - System setup - Evaluation 3. Future Work

35 Limitations & Future Work Laughter and scene transition music manually trained Require scripts and closed captions Available from show producers Failed single-speaker segments how to handle? Retrain speaker models HMM for the whole episode Look at other genres (dramas, soap operas, lectures?) New rules Add visual data

36 Thanks!

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts Gerald Friedland International Computer Science Institute 1947 Center Street, Suite 600 Berkeley, CA 94704-1198 fractor@icsi.berkeley.edu