Multimodal databases at KTH - PDF Free Download

Multimodal databases at David House, Jens Edlund & Jonas Beskow Clarin Workshop

The QSMT database (2002): Facial & Articulatory motion Clarin Workshop Purpose Obtain coherent data for modelling and animation of face and vocal tract in a combined model Explore correlation between vocal tract and face motion Predict tongue motion from face and vice verca

The QSMT database (2002) Clarin Workshop Contents Single speaker 270 short swedish sentences 7-9 syllables 138 VCV and VCC{C}V words 22 consonants, 24 consonant clusters 3 carrier vowel contexts 41 C 1 VC 2 words 15 vowels Asymmetric consonant contexts Two sessions with and without EMA

The QSMT database (2002) Clarin Workshop Setup Optical motion tracking 4-camera Qualisys system (60 Hz) 3D positions of reflective markers EMA MoveTrack system Records 2D midsagittal coil positions Audio Video (DV) Sync-signal

The QSMT database (2002) Clarin Workshop Merging of data sources EMA down-sampled to 60 Hz Temporally synchronized with optical data Spatial alignment (one co-registered marker) EMA (2D) inserted into 3D space at midsagittal plane

The QSMT database (2002) Clarin Workshop Re-synthesis: /A P A/

The QSMT database (2002) Clarin Workshop Re-synthesis: /A T A/

The QSMT database (2002) Clarin Workshop Re-synthesis: /A L A/

The QSMT database (2002) Clarin Workshop Re-synthesis: dom flyttade möblerna

PF-STAR database (2005): Acted Expressive Speech Clarin Workshop Purpose Data for talking head modeling Synthesis of expressive visual speech Studies of non-verbal facial motion (e.g. on focused words)

PF-STAR database (2005) Clarin Workshop Contents Single speaker (Swedish male amateur actor) Expressive sentences 75 sentences x 5 acted emotions Focus sentences 3-word sentences, read 3 times with focus on each word x 7 expressive modes Short semi-scripted dialogues

PF-STAR database (2005) Clarin Workshop Setup 3D motion capture Qualisys MacReflex 4 IR-cameras 60 Hz capture rate Sub-millimeter accuracy 29 reflective markers 4 for skull reference 25 for face deformation (articulation + expression) Audio + video recording Ref Ref

PF-STAR database (2005) Clarin Workshop Data processing Tracking, gap-filling & checking ~ 70 of each 75-sentence set were usable Normalisation for global head movements Calculation of MPEG-4 FAPs A (sub-)set of 38 low-level face parameters: Jaw (4) Lips (22) Cheeks (4) Eyebrows (12) Verification through re-animation

PF-STAR database (2005) Clarin Workshop Re-synthesis: travel agent dialogue

PF-STAR database (2005) Clarin Workshop Expressive visual speech synthesis Angry Happy Sad

Swedish Multimodal Database Clarin Workshop Research Project: Multimodal database of spontaneous speech in dialog 2007-2010, funded by the Swedish Research Council, KFI - Grant for large databases

Research Program Clarin Workshop Both vocal signals and facial and body gestures are important for communicative interaction Signals for turn-taking, feedback giving or seeking, and emotions and attitudes can be both vocal and visual Our understanding of vocal and visual cues and interactions in spontaneous speech is growing, but there is a great need for data with which we can make more precise measurements A large Swedish multimodal database will enable researchers to test hypotheses covering a variety of functions of visual and verbal behavior in dialog Freely available for research

Project Goals Clarin Workshop Swedish multimodal spontaneous speech database Rich enough to capture speaker and speaking style variation High-quality audio and video recordings (HD) Motion capture for body and head movements for all recordings 5% of the recordings to include motion capture for facial and head gestures

Swedish Multimodal Database Clarin Workshop female-female 15 dialogues* female-male friends male-female strangers motion capture male-male At least eight dialogues with motion capture for gesture and facial movements (one per configuration) Motion capture for body gestures for nearly the entire database 20 minutes free dialog, 10 minutes discussion of an artifact Total database = 120 dialogues, 30 minutes for each dialog = 60 hours

Clarin Workshop

Best practices 60 (70+) hours 4 + 2 audio channels 2 video channels 24 + 4 mocap markers 2+ Tb of data (15+ Gb per recording) Automate! Simpler Consistent Repeatable Method used standard

Synchronisation Online synchronisation is complex 4+ channels of audio Sync straightforward Analogue sound to one sound card Exact frame rate unknown 2 video cameras Unsynced Internal hard-drive Exact frame rate unknown 6 motion capture cameras Individually in sync USB Exact frame rate unknown Large variation in frame rate 66 Hz, 98-102 Hz

Synchronisation II Signals for off-line synchronisation Events (Start/End) - one switch controls Sine tone (goes into commentary audio channel) Green dioides (can be found automatically in video) IR diode (appears to be a marker in mocap) Stream - turn-table with marker and record Scratch in record creates click in separate audio channel Marker is captured by mocap and video

Video processing Automatic download and processing Merging of files Wrapping in legible wrapper Production of work files lo-res browse copy Stills Average images Average images use for annotation of pertinent areas Green light Face Automatic detection of start and endpoints Automatic production of demo film and face closeups x

Mocap processing Detection of turntable marker Time stamping based on turntable Detection of start-end marker Detection of start, end, etc. Marker identification and resorting Resampling into constant framerate

Audio processing Start-end detection (blind source localization, filtering) Speech detection Creation of lo-res copy Ortographic transcription Words Events Speech detection errors Validation (automatic and manual) Forced alignment Validation, lexicon correction and realignment

Result Speech/non-speech Breath, coughs, laughter, etc. Places of interest Ortographic transcription Pronounciation lexicon of all words Phoneme strings with times Gesture tracks Video Guidelines through which data can be recreated

Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n 212230