Multimodal databases at David House, Jens Edlund & Jonas Beskow Clarin Workshop
The QSMT database (2002): Facial & Articulatory motion Clarin Workshop Purpose Obtain coherent data for modelling and animation of face and vocal tract in a combined model Explore correlation between vocal tract and face motion Predict tongue motion from face and vice verca
The QSMT database (2002) Clarin Workshop Contents Single speaker 270 short swedish sentences 7-9 syllables 138 VCV and VCC{C}V words 22 consonants, 24 consonant clusters 3 carrier vowel contexts 41 C 1 VC 2 words 15 vowels Asymmetric consonant contexts Two sessions with and without EMA
The QSMT database (2002) Clarin Workshop Setup Optical motion tracking 4-camera Qualisys system (60 Hz) 3D positions of reflective markers EMA MoveTrack system Records 2D midsagittal coil positions Audio Video (DV) Sync-signal
The QSMT database (2002) Clarin Workshop Merging of data sources EMA down-sampled to 60 Hz Temporally synchronized with optical data Spatial alignment (one co-registered marker) EMA (2D) inserted into 3D space at midsagittal plane
The QSMT database (2002) Clarin Workshop Re-synthesis: /A P A/
The QSMT database (2002) Clarin Workshop Re-synthesis: /A T A/
The QSMT database (2002) Clarin Workshop Re-synthesis: /A L A/
The QSMT database (2002) Clarin Workshop Re-synthesis: dom flyttade möblerna
PF-STAR database (2005): Acted Expressive Speech Clarin Workshop Purpose Data for talking head modeling Synthesis of expressive visual speech Studies of non-verbal facial motion (e.g. on focused words)
PF-STAR database (2005) Clarin Workshop Contents Single speaker (Swedish male amateur actor) Expressive sentences 75 sentences x 5 acted emotions Focus sentences 3-word sentences, read 3 times with focus on each word x 7 expressive modes Short semi-scripted dialogues
PF-STAR database (2005) Clarin Workshop Setup 3D motion capture Qualisys MacReflex 4 IR-cameras 60 Hz capture rate Sub-millimeter accuracy 29 reflective markers 4 for skull reference 25 for face deformation (articulation + expression) Audio + video recording Ref Ref
PF-STAR database (2005) Clarin Workshop Data processing Tracking, gap-filling & checking ~ 70 of each 75-sentence set were usable Normalisation for global head movements Calculation of MPEG-4 FAPs A (sub-)set of 38 low-level face parameters: Jaw (4) Lips (22) Cheeks (4) Eyebrows (12) Verification through re-animation
PF-STAR database (2005) Clarin Workshop Re-synthesis: travel agent dialogue
PF-STAR database (2005) Clarin Workshop Expressive visual speech synthesis Angry Happy Sad
Swedish Multimodal Database Clarin Workshop Research Project: Multimodal database of spontaneous speech in dialog 2007-2010, funded by the Swedish Research Council, KFI - Grant for large databases
Research Program Clarin Workshop Both vocal signals and facial and body gestures are important for communicative interaction Signals for turn-taking, feedback giving or seeking, and emotions and attitudes can be both vocal and visual Our understanding of vocal and visual cues and interactions in spontaneous speech is growing, but there is a great need for data with which we can make more precise measurements A large Swedish multimodal database will enable researchers to test hypotheses covering a variety of functions of visual and verbal behavior in dialog Freely available for research
Project Goals Clarin Workshop Swedish multimodal spontaneous speech database Rich enough to capture speaker and speaking style variation High-quality audio and video recordings (HD) Motion capture for body and head movements for all recordings 5% of the recordings to include motion capture for facial and head gestures
Swedish Multimodal Database Clarin Workshop female-female 15 dialogues* female-male friends male-female strangers motion capture male-male At least eight dialogues with motion capture for gesture and facial movements (one per configuration) Motion capture for body gestures for nearly the entire database 20 minutes free dialog, 10 minutes discussion of an artifact Total database = 120 dialogues, 30 minutes for each dialog = 60 hours
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Clarin Workshop
Best practices 60 (70+) hours 4 + 2 audio channels 2 video channels 24 + 4 mocap markers 2+ Tb of data (15+ Gb per recording) Automate! Simpler Consistent Repeatable Method used standard
Synchronisation Online synchronisation is complex 4+ channels of audio Sync straightforward Analogue sound to one sound card Exact frame rate unknown 2 video cameras Unsynced Internal hard-drive Exact frame rate unknown 6 motion capture cameras Individually in sync USB Exact frame rate unknown Large variation in frame rate 66 Hz, 98-102 Hz
Synchronisation II Signals for off-line synchronisation Events (Start/End) - one switch controls Sine tone (goes into commentary audio channel) Green dioides (can be found automatically in video) IR diode (appears to be a marker in mocap) Stream - turn-table with marker and record Scratch in record creates click in separate audio channel Marker is captured by mocap and video
Video processing Automatic download and processing Merging of files Wrapping in legible wrapper Production of work files lo-res browse copy Stills Average images Average images use for annotation of pertinent areas Green light Face Automatic detection of start and endpoints Automatic production of demo film and face closeups x
Mocap processing Detection of turntable marker Time stamping based on turntable Detection of start-end marker Detection of start, end, etc. Marker identification and resorting Resampling into constant framerate
Audio processing Start-end detection (blind source localization, filtering) Speech detection Creation of lo-res copy Ortographic transcription Words Events Speech detection errors Validation (automatic and manual) Forced alignment Validation, lexicon correction and realignment
Result Speech/non-speech Breath, coughs, laughter, etc. Places of interest Ortographic transcription Pronounciation lexicon of all words Phoneme strings with times Gesture tracks Video Guidelines through which data can be recreated
Thank you for your attention CLARIN has received funding from the European Community's Seventh Framework Programme under grant agreement n 212230