Multimodal databases at KTH

Similar documents
Laugh when you re winning

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

KEY INDICATORS FOR MONITORING AUDIOVISUAL QUALITY

06 Video. Multimedia Systems. Video Standards, Compression, Post Production

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

Torsional vibration analysis in ArtemiS SUITE 1

Seminar CHIST-ERA Istanbul : 4 March 2014 Kick-off meeting : 27 January 2014 (call IUI 2012)

HBI Database. Version 2 (User Manual)

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Speech and Speaker Recognition for the Command of an Industrial Robot

Making music with voice. Distinguished lecture, CIRMMT Jan 2009, Copyright Johan Sundberg

Case Study: Can Video Quality Testing be Scripted?

Advanced Signal Processing 2

Empirical Evaluation of Animated Agents In a Multi-Modal E-Retail Application

Common Spatial Patterns 2 class BCI V Copyright 2012 g.tec medical engineering GmbH

Music Information Retrieval (MIR)

Quarterly Progress and Status Report. Formant frequency tuning in singing

Welcome to Vibrationdata

Zooming into saxophone performance: Tongue and finger coordination

Supervision of Analogue Signal Paths in Legacy Media Migration Processes using Digital Signal Processing

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Comparison Parameters and Speaker Similarity Coincidence Criteria:

Expressive performance in music: Mapping acoustic cues onto facial expressions

Music Information Retrieval (MIR)

EBU Digital AV Sync and Operational Test Pattern

Getting Started with the LabVIEW Sound and Vibration Toolkit

Musical Entrainment Subsumes Bodily Gestures Its Definition Needs a Spatiotemporal Dimension

Common Spatial Patterns 3 class BCI V Copyright 2012 g.tec medical engineering GmbH

Automatic Laughter Detection

PRELIMINARY INFORMATION. Professional Signal Generation and Monitoring Options for RIFEforLIFE Research Equipment

Hidden melody in music playing motion: Music recording using optical motion tracking system

CONSOLIDATED VERSION IEC Digital audio interface Part 3: Consumer applications. colour inside. Edition

ATSC vs NTSC Spectrum. ATSC 8VSB Data Framing

DRAMA. Performance and response. GCSE (9 1) Learner Booklet. Component 04 examined assessment : Key definitions and points for learners

LMS301: Reference Management Software (Mendeley)

Brain-Computer Interface (BCI)

Tutorial Session 8:00 am Feb. 2, Robert Schaefer, Agilent Technologies Feb. 2, 2009

USO RESTRITO. WSS Decoder. Option W Version: 2.0 March 20, 2015

PicoScope 9300 Series migration guide

MPEG4 Digital Recording System THE VXM4 RANGE FROM A NAME YOU CAN RELY ON

Application Note #63 Field Analyzers in EMC Radiated Immunity Testing

Getting Started Guide

HEAD. HEAD VISOR (Code 7500ff) Overview. Features. System for online localization of sound sources in real time

OptoFidelity Video Multimeter User Manual Version 2017Q1.0

SIEMPRE D3.1 Techniques for data acquisition and multimodal analysis of emap signals.

AT720USB. Digital Video Interfacing Products. DVB-C (QAM-B, 8VSB) Input Receiver & Recorder & TS Player DVB-ASI & DVB-SPI outputs

UTTR BEST TELEMETRY SOURCE SELECTOR

PHY221 Lab 3 - Projectile Motion and Video Analysis Video analysis of flying and rolling objects.

The H.26L Video Coding Project

SIERRA VIDEO SP-14 SETUP GUIDE. User s Manual

Cisco StadiumVision Defining Channels and Channel Guides in SV Director

Filtration manager for automatic calculation of corrected differential pressure measurement in refuelling applications

Asynchronous inputs. 9 - Metastability and Clock Recovery. A simple synchronizer. Only one synchronizer per input

Pitch-Synchronous Spectrogram: Principles and Applications

specification display applications cronus spectrometer & colorimeter

R H Y T H M G E N E R A T O R. User Guide. Version 1.3.0

Supplemental Material for Gamma-band Synchronization in the Macaque Hippocampus and Memory Formation

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Allen ISD Bundled Curriculum Document. Grade level Time Allotted: Days Content Area Theatre 2 Unit 1 Unit Name:

VOCAL MUSIC CURRICULUM STANDARDS Grades Students will sing, alone and with others, a varied repertoire of music.

Application of a Musical-based Interaction System to the Waseda Flutist Robot WF-4RIV: Development Results and Performance Experiments

Analysis of the Occurrence of Laughter in Meetings

Controlling Musical Tempo from Dance Movement in Real-Time: A Possible Approach

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013

Drama and Theatre Art Preschool

Smile and Laughter in Human-Machine Interaction: a study of engagement

gresearch Focus Cognitive Sciences

Connection for filtered air

NI 5431 Video Generator Instrument Driver Quick Reference Guide

DPD80 Visible Datasheet

HCImage Live Getting Started Guide

The roles of expertise and partnership in collaborative rehearsal

Getting started with EndNote online

PERCEPTUAL QUALITY COMPARISON BETWEEN SINGLE-LAYER AND SCALABLE VIDEOS AT THE SAME SPATIAL, TEMPORAL AND AMPLITUDE RESOLUTIONS. Yuanyi Xue, Yao Wang

DPD80 Infrared Datasheet

Mark Litwak & Associates

Beethoven, Bach, and Billions of Bytes

ADJUDICATION SHEET CRAFTS

Lab Assignment 2 Simulation and Image Processing

Music Representations. Beethoven, Bach, and Billions of Bytes. Music. Research Goals. Piano Roll Representation. Player Piano (1900)

DB-VRC4H 4K HDMI Compact Video Wall Controller with 45 degree screen rotation

2-/4-Channel Cam Viewer E- series for Automatic License Plate Recognition CV7-LP

AT70XUSB. Digital Video Interfacing Products

MAKING INTERACTIVE GUIDES MORE ATTRACTIVE

Video VBOX Pro RLVD10P2P/RLVD10P2PV/RLVD10P4PV/RLVD20P2PV/RLVD20P4PV. Features

Cable Tester Automation by Christopher E. Strangio, CAMI Research Inc.

Student Projects that ROCK! Xtranormal (animated short films) Candie Black & Susan Reed. November 14, 2011 Location Eastridge HS Room 136

Using Spectrum Laboratory (Spec Lab) for Precise Audio Frequency Measurements

Oculomatic Pro. Setup and User Guide. 4/19/ rev

Real-time magnetic resonance imaging investigation of resonance tuning in soprano singing

Building Video and Audio Test Systems. NI Technical Symposium 2008

Essential Questions. Introduction to Drama: List and explain four reasons people create dramatic works.

INTERNATIONAL STANDARD

Motion Video Compression

AUDIOVISUAL COMMUNICATION

980 Protocol Analyzer General Presentation. Quantum Data Inc Big Timber Road Elgin, IL USA Phone: (847)

TEN.02_TECHNICAL DELIVERY - INTERNATIONAL

COPYRIGHT 2011 AXON DIGITAL DESIGN B.V. ALL RIGHTS RESERVED

Video VBOX Lite RLVBVD10LT2. Powerful VBOX data logging and video technology in a user-friendly, affordable product. What can Video VBOX Lite do?

Approaches to teaching film

Transcription:

Multimodal databases at David House, Jens Edlund & Jonas Beskow Clarin Workshop

The QSMT database (2002): Facial & Articulatory motion Clarin Workshop Purpose Obtain coherent data for modelling and animation of face and vocal tract in a combined model Explore correlation between vocal tract and face motion Predict tongue motion from face and vice verca

The QSMT database (2002) Clarin Workshop Contents Single speaker 270 short swedish sentences 7-9 syllables 138 VCV and VCC{C}V words 22 consonants, 24 consonant clusters 3 carrier vowel contexts 41 C 1 VC 2 words 15 vowels Asymmetric consonant contexts Two sessions with and without EMA

The QSMT database (2002) Clarin Workshop Setup Optical motion tracking 4-camera Qualisys system (60 Hz) 3D positions of reflective markers EMA MoveTrack system Records 2D midsagittal coil positions Audio Video (DV) Sync-signal

The QSMT database (2002) Clarin Workshop Merging of data sources EMA down-sampled to 60 Hz Temporally synchronized with optical data Spatial alignment (one co-registered marker) EMA (2D) inserted into 3D space at midsagittal plane

The QSMT database (2002) Clarin Workshop Re-synthesis: /A P A/

The QSMT database (2002) Clarin Workshop Re-synthesis: /A T A/

The QSMT database (2002) Clarin Workshop Re-synthesis: /A L A/

The QSMT database (2002) Clarin Workshop Re-synthesis: dom flyttade möblerna

PF-STAR database (2005): Acted Expressive Speech Clarin Workshop Purpose Data for talking head modeling Synthesis of expressive visual speech Studies of non-verbal facial motion (e.g. on focused words)

PF-STAR database (2005) Clarin Workshop Contents Single speaker (Swedish male amateur actor) Expressive sentences 75 sentences x 5 acted emotions Focus sentences 3-word sentences, read 3 times with focus on each word x 7 expressive modes Short semi-scripted dialogues

PF-STAR database (2005) Clarin Workshop Setup 3D motion capture Qualisys MacReflex 4 IR-cameras 60 Hz capture rate Sub-millimeter accuracy 29 reflective markers 4 for skull reference 25 for face deformation (articulation + expression) Audio + video recording Ref Ref

PF-STAR database (2005) Clarin Workshop Data processing Tracking, gap-filling & checking ~ 70 of each 75-sentence set were usable Normalisation for global head movements Calculation of MPEG-4 FAPs A (sub-)set of 38 low-level face parameters: Jaw (4) Lips (22) Cheeks (4) Eyebrows (12) Verification through re-animation

PF-STAR database (2005) Clarin Workshop Re-synthesis: travel agent dialogue

PF-STAR database (2005) Clarin Workshop Expressive visual speech synthesis Angry Happy Sad

Swedish Multimodal Database Clarin Workshop Research Project: Multimodal database of spontaneous speech in dialog 2007-2010, funded by the Swedish Research Council, KFI - Grant for large databases

Research Program Clarin Workshop Both vocal signals and facial and body gestures are important for communicative interaction Signals for turn-taking, feedback giving or seeking, and emotions and attitudes can be both vocal and visual Our understanding of vocal and visual cues and interactions in spontaneous speech is growing, but there is a great need for data with which we can make more precise measurements A large Swedish multimodal database will enable researchers to test hypotheses covering a variety of functions of visual and verbal behavior in dialog Freely available for research

Project Goals Clarin Workshop Swedish multimodal spontaneous speech database Rich enough to capture speaker and speaking style variation High-quality audio and video recordings (HD) Motion capture for body and head movements for all recordings 5% of the recordings to include motion capture for facial and head gestures

Swedish Multimodal Database Clarin Workshop female-female 15 dialogues* female-male friends male-female strangers motion capture male-male At least eight dialogues with motion capture for gesture and facial movements (one per configuration) Motion capture for body gestures for nearly the entire database 20 minutes free dialog, 10 minutes discussion of an artifact Total database = 120 dialogues, 30 minutes for each dialog = 60 hours

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Clarin Workshop

Best practices 60 (70+) hours 4 + 2 audio channels 2 video channels 24 + 4 mocap markers 2+ Tb of data (15+ Gb per recording) Automate! Simpler Consistent Repeatable Method used standard

Synchronisation Online synchronisation is complex 4+ channels of audio Sync straightforward Analogue sound to one sound card Exact frame rate unknown 2 video cameras Unsynced Internal hard-drive Exact frame rate unknown 6 motion capture cameras Individually in sync USB Exact frame rate unknown Large variation in frame rate 66 Hz, 98-102 Hz

Synchronisation II Signals for off-line synchronisation Events (Start/End) - one switch controls Sine tone (goes into commentary audio channel) Green dioides (can be found automatically in video) IR diode (appears to be a marker in mocap) Stream - turn-table with marker and record Scratch in record creates click in separate audio channel Marker is captured by mocap and video

Video processing Automatic download and processing Merging of files Wrapping in legible wrapper Production of work files lo-res browse copy Stills Average images Average images use for annotation of pertinent areas Green light Face Automatic detection of start and endpoints Automatic production of demo film and face closeups x

Mocap processing Detection of turntable marker Time stamping based on turntable Detection of start-end marker Detection of start, end, etc. Marker identification and resorting Resampling into constant framerate

Audio processing Start-end detection (blind source localization, filtering) Speech detection Creation of lo-res copy Ortographic transcription Words Events Speech detection errors Validation (automatic and manual) Forced alignment Validation, lexicon correction and realignment

Result Speech/non-speech Breath, coughs, laughter, etc. Places of interest Ortographic transcription Pronounciation lexicon of all words Phoneme strings with times Gesture tracks Video Guidelines through which data can be recreated

Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n 212230