Experiments with Fisher Data

Similar documents
Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems

1998 BROADCAST NEWS BENCHMARK TEST RESULTS: ENGLISH AND NON-ENGLISH WORD ERROR RATE PERFORMANCE MEASURES

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Pattern Smoothing for Compressed Video Transmission

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Retrieval of textual song lyrics from sung inputs

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Audio Compression Technology for Voice Transmission

FPGA-BASED IMPLEMENTATION OF A REAL-TIME 5000-WORD CONTINUOUS SPEECH RECOGNIZER

Automatic Laughter Detection

Machine Translation Part 2, and the EM Algorithm

Before attempting to connect or operate this product, please read these instructions carefully and save this manual for future use.

A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Music Genre Classification

Transcription of the Singing Melody in Polyphonic Music

Using the Agilent for Single Crystal Work

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Measuring Radio Network Performance

Chapter 2 Introduction to

Statistical Modeling and Retrieval of Polyphonic Music

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Design Project: Designing a Viterbi Decoder (PART I)

Detecting Attempts at Humor in Multiparty Meetings

The decoder in statistical machine translation: how does it work?

Machine Translation: Examples. Statistical NLP Spring MT: Evaluation. Phrasal / Syntactic MT: Examples. Lecture 7: Phrase-Based MT

University of Bristol - Explore Bristol Research. Peer reviewed version. Link to published version (if available): /ISCAS.2005.

MTO 22.1 Examples: Carter-Ényì, Contour Recursion and Auto-Segmentation

THE MAJORITY of the time spent by automatic test

Probabilist modeling of musical chord sequences for music analysis

Semi-supervised Musical Instrument Recognition

Computational Modelling of Harmony

The Relationship Between Movie theater Attendance and Streaming Behavior. Survey Findings. December 2018

Pitch correction on the human voice

StaMPS Persistent Scatterer Practical

Package spotsegmentation

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING 1. Note Segmentation and Quantization for Music Information Retrieval

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

StaMPS Persistent Scatterer Exercise

MUSI-6201 Computational Music Analysis

Controlling Peak Power During Scan Testing

for Digital IC's Design-for-Test and Embedded Core Systems Alfred L. Crouch Prentice Hall PTR Upper Saddle River, NJ

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Subtitle Safe Crop Area SCA

Time Domain Simulations

A real time study of plosives in Glaswegian using an automatic measurement algorithm

SWITCHED INFINITY: SUPPORTING AN INFINITE HD LINEUP WITH SDV

Automatic Laughter Detection

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

Toward Access to Multi-Perspective Archival Spoken Word Content

Music Composition with RNN

Bridging the Gap Between CBR and VBR for H264 Standard

Error concealment algorithms for an ATM videoconferencing system

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

Improving Frame Based Automatic Laughter Detection

Week 14 Music Understanding and Classification

Adaptive Testing Cost Reduction through Test Pattern Sampling

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

1 Introduction to PSQM

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

2. AN INTROSPECTION OF THE MORPHING PROCESS

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

Phone-based Plosive Detection

Reduced complexity MPEG2 video post-processing for HD display

GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA

Put your sound where it belongs: Numerical optimization of sound systems. Stefan Feistel, Bruce C. Olson, Ana M. Jaramillo AFMG Technologies GmbH

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Contents. Welcome to LCAST. System Requirements. Compatibility. Installation and Authorization. Loudness Metering. True-Peak Metering

Analysis of Video Transmission over Lossy Channels

TERRESTRIAL broadcasting of digital television (DTV)


Music Alignment and Applications. Introduction

EPI. Thanks to Samantha Holdsworth!

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. Saif Mohammad! National Research Council Canada

Cedits bim bum bam. OOG series

Topics in Computer Music Instrument Identification. Ioanna Karydi

A repetition-based framework for lyric alignment in popular songs

Chapter Two: Long-Term Memory for Timbre

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Chorale Harmonisation in the Style of J.S. Bach A Machine Learning Approach. Alex Chilvers

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Video coding standards

A fragment-decoding plus missing-data imputation ASR system evaluated on the 2nd CHiME Challenge

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction

An Efficient Implementation of Interactive Video-on-Demand

ADAPTIVE DIFFERENTIAL MICROPHONE ARRAYS USED AS A FRONT-END FOR AN AUTOMATIC SPEECH RECOGNITION SYSTEM

Statistical NLP Spring Machine Translation: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

Analysis of the Occurrence of Laughter in Meetings

Comparison Parameters and Speaker Similarity Coincidence Criteria:

IMPROVING MARKOV MODEL-BASED MUSIC PIECE STRUCTURE LABELLING WITH ACOUSTIC INFORMATION

Orchestral Composition Steven Yi. early release

Wipe Scene Change Detection in Video Sequences

Transcription:

Experiments with Fisher Data Gunnar Evermann, Bin Jia, Kai Yu, David Mrva Ricky Chan, Mark Gales, Phil Woodland May 16th 2004 EARS STT Meeting May 2004 Montreal

Overview Introduction Pre-processing 2000h of Fisher data Fisher dev04 test set Language Modelling Acoustic model training on Fisher Modelling techniques (MMI prior for MPE, MPE-MAP, Gaussianisation) Conclusions EARS STT Meeting May 2004 Montreal 1

Fisher Data Processing Original transcriptions: 1940h data (1758h BBN data, 182h LDC data) Normalise the text, join segments, pad with silence as necessary Apply replacement rules Abbreviations, typos, non-speech, etc. e.g. CD C. D., PRIVELAGE PRIVILEGE, [STATIC] - about 11k replacement rules were produced Produce pronunciations for 6800 unknown words (4100 whole words and 2700 partial words) with frequency greater than 2 8500 unknown words remain remove 14h worth of segments. Align the segments and normalise silence boundaries <30h segments failed to align 1819h data remained (Gender imbalance: 1042h female, 777h male) EARS STT Meeting May 2004 Montreal 2

Training and Test Sets Acoustic training data h5train03b 360h data set 290h LDC data (Swb1, CHE, Swb Cellular) with MSU/LDC careful transcriptions. 70h BBN data (Cellular, Swb2-2) with quick transcriptions fisher3896 520h Fisher data set, 3896 conversations with Algorithm 1 quick transcriptions: results presented in St.Thomas fsh2004 1820h Fisher data set fsh2004sub 400h Fisher subset (balanced for gender and line condition) fsh2004sub2 800h Fisher subset (gender balanced) Test sets eval03 6h set from Fisher and Swb2-5 data, 72 conversations dev04 3h set from Fisher, 36 conversations EARS STT Meeting May 2004 Montreal 3

CTS dev04 test set Ran CU-HTK 2003 10xRT system on dev04 to test robustness on Fisher pass eval03fi dev04 P1 29.7 29.8 P2 latgen 20.0 21.0 P3 (SAT) 18.8 19.3 P3 (SPron) 18.9 19.5 final 18.4 18.9 final (STM) 18.6 %WER on two Fisher test sets (3h each) with 2003 10xRT system LM perplexity with RT03 fourgram: eval03fi: 65.7 dev04: 61.9 Overall dev04 is slightly harder than eval03fisher and the progress set (18.2%) Would like to know gender and line types for dev04 sides EARS STT Meeting May 2004 Montreal 4

How (not) to Optimise LM Interpolation Weights Train separate n-gram on each corpus (Swb1, Cell1, Fisher, BN, Google, etc.) Optimise interpolation weights on a dev set (reference STM) Merge component n-grams into single LM Problem: reference STM had all contractions expanded (don t do not) corpus size weight (STM) weight (non-exp) BN 427M 0.137 0.120 google 63M 0.071 0.063 cell1 0.2M 0.230 0.021 che/sw1 3M 0.022 0.042 swb2 0.9M 0.006 0.053 fisher 21M 0.534 0.700 weights optimised on dev04 EARS STT Meeting May 2004 Montreal 5

Fisher Language Models Perplexities Train separate word 4-gram on all fisher data (21M words) Interpolate with RT-03 component n-grams Language Model optimised on Perplexity fgint03 dev01+eval01/03 exp 62.0 fgint04 dev04 exp 53.6 fgint04 dev04 no exp. 52.8 Perplexities of word 4-grams on dev04 with unexpanded contractions fgint03 word fourgram used in 2003 CU-HTK system (5 components) fgint04 above components plus fsh2004 4-gram component size of fgint04: 6.3M bigrams, 11.6M trigrams, 4.8M 4-grams EARS STT Meeting May 2004 Montreal 6

Fisher Language Models WER Tested new LM by rescoring 2003 CU-HTK full system lattices LM optimised on WER Swb Fsh fgint03 dev01+eval01/02 exp 23.5 27.4 19.3 fgint04 dev04 exp 22.6 26.7 18.3 fgint04 dev04 noexp 22.6 26.8 18.1 fgint04 dev04+eval03 noexp 22.6 26.8 18.1 %WER on eval03, rescoring 2003 CU-HTK system lattices On the Fisher portion of the test set: (fgintcat03, adapted HLDA MPE models) Using Fisher data for language modelling gives 1.2% abs. Optimising interpolation weights incorrectly cost 0.2% abs. EARS STT Meeting May 2004 Montreal 7

Fisher acoustic modelling Overall strategy: Pre-process all data (align, VTLN, etc.) Fix various issues with Software & infrastructure for large data sets (issues with numerical accuracy, avoid having directories with 20k files, etc.) Select manageable subset as baseline for investigation of new techniques 400h, balanced for gender, line conditions, topics Concurrently investigate training on larger amounts of data MLE & MPE models for 800h fisher set MLE models for all fsh2004 + h5train03b (2200h total) EARS STT Meeting May 2004 Montreal 8

Subset selection A 400h subset was selected from the whole fisher data set only whole conversations used only use sides for which all labels (gender, line, topic) were available ignore sides that were too short or had a high percentage of data fail to align balance gender select 25% cellular data (like current and progress sets) aim for even topic distribution EARS STT Meeting May 2004 Montreal 9

MLE/MPE on 400h Fisher Train models on new 400h Fisher subset Number of parameters same as before (about 6000 states, 28 components) eval03 eval03sw eval03fi dev04 ML h5train03b (360h) 31.7 36.1 27.1 28.1 ML fisher3896 (520h) 30.8 34.7 26.6 26.9 ML fsh2004sub (400h) 30.8 34.6 26.7 26.8 MPE h5train03b (360h) 27.3 31.6 22.7 23.7 MPE fisher3896 (520h) 26.2 30.0 22.2 22.3 MPE fsh2004sub (400h) 26.3 29.9 22.5 22.3 %WER on eval03 and dev04, unadapted, 2003 trigram New Fisher 400h set gives very similar performance to old 520h one WER reduction of 1% abs. over 2003 training set EARS STT Meeting May 2004 Montreal 10

MPE with dynamic MMI prior Use dynamic MMI estimates instead of ML estimates as the I-smoothing prior 4 sets of statistics to accumulate: num, den, ml, mmi-den, extra 1/3 memory and disk space, no extra computation MPE Prior MPE-τ I MMI-τ I eval03 eval03sw eval03fi Dynamic ML 50 26.3 29.9 22.5 Dynamic MMI 75 0 25.9 29.6 21.9 %WER on eval03 for MPE models trained on fsh2004sub, unadapted, 2003 trigram EARS STT Meeting May 2004 Montreal 11

Larger data sets Compare 400h subset with larger training sets eval03 eval03sw eval03fi dev04 ML h5train03b (360h) 31.7 36.1 27.1 28.1 ML fsh2004sub (400h) 30.8 34.6 26.7 26.8 ML fsh2004sub2 (800h) 30.5 34.4 26.4 26.5 ML fsh2004h5train03b (2200h) 30.2 34.1 26.0 26.4 MPE h5train03b (360h) 27.3 31.6 22.7 23.7 MPE fsh2004sub (400h) 25.9 29.6 21.9 21.9 MPE fsh2004sub2 (800h) 25.1 28.9 21.1 21.3 %WER on eval03 and dev04, unadapted, 2003 trigram, Fisher models used MMI prior Adding 1800h of Fisher to acoustic training improves ML models by 1.5% abs. 2.2% abs. WER reduction from using 800h Fisher instead of 360h h5train03 EARS STT Meeting May 2004 Montreal 12

Putting it all together: CU-HTK P1-P2 System (5xRT) eval03 eval03sw eval03fi h5train03b (360h) LM03 24.6 28.7 20.2 h5train03b (360h) LM03 + fsh 23.3 27.6 18.6 fsh2004sub (400h) LM03 + fsh 22.7 26.7 18.4 fsh2004sub2 (800h) LM03 + fsh 22.0 25.9 17.8 %WER on eval03, MPE models, word 4-gram, simple adaptation h5train03b: Fisher data in LM gives 1.3% abs. improvement (1.6% on Fisher) fsh2004sub (400h) performs 0.6% better than h5train03b (360h) doubling the amount of fisher data gives an additional 0.7% Total WER reduction of 2.6% abs. (2.4% on Fisher) from using 800h of Fisher data instead of 360h Swb/CHE data for acoustics and LM EARS STT Meeting May 2004 Montreal 13

MPE Training for Gender-dependent Models GD MPE training of means and mix weights on top of GI MPE training Static MPE-GI model parameters used as the I-smoothing prior Unadapted single pass decode: System MPE Prior eval03 Male Female MPE-GI Dynamic MMI 25.9 27.3 24.5 MPE-GD MPE-GI model 25.6 27.1 24.1 %WER on eval03, fsh2004sub models, unadapted, 2003 trigram Test with adaptation in P1-P2 system: System MPE Prior eval03 Male Female MPE-GI Dynamic MMI 22.7 24.0 21.4 MPE-GD MPE-GI model 22.4 23.8 21.0 %WER on eval03, fsh2004sub models, adapted, LM03+Fsh 4-gram, P1-P2 system EARS STT Meeting May 2004 Montreal 14

Gaussianisation Transform any distribution to standard Gaussian N (0, I) 0.1 100 100 0.4 0.09 90 90 0.35 0.08 0.07 80 70 80 70 0.3 0.06 60 60 0.25 0.05 50 50 0.2 0.04 40 40 0.15 0.03 0.02 30 20 30 20 0.1 0.01 10 10 0.05 0 20 15 10 5 0 5 10 15 20 0 20 15 10 5 0 5 10 15 20 0 5 4 3 2 1 0 1 2 3 4 5 0 5 4 3 2 1 0 1 2 3 4 5 Source PDF Source CDF Target CDF Target PDF Use multiple-stream (one per dimension) GMMs per speaker after HLDA: simplified version of iterative Chen and Gopinath scheme; more compact, smoother, representation than using data (IBM style); simple to implement in HTK... May be viewed as higher-moment version of CMN and CVN EARS STT Meeting May 2004 Montreal 15

Gaussianisation MPE Results fsh2004sub (400hr) training set - 28 components + varmix; unadapted decode with 2003 trigram System Swb Fsh Tot Baseline 29.7 21.9 26.0 +CN 28.8 21.3 25.2 Gaussianised 29.8 21.9 26.0 +CN 28.7 21.3 25.1 CNC 28.1 20.8 24.6 No gain over baseline with fsh2004sub disappointing - with h5train03b 0.4% absolute gain on eval03 Possibly useful for system combination (but need adapted numbers) EARS STT Meeting May 2004 Montreal 16

Conclusions Fisher data for LM training reduces WER by 1.3% abs. Overall 2.6% WER reduction in P1-P2 system from using 800h Fisher for acoustic training and all Fisher data for LM Using all Fisher and h5train03b together in MPE should improve WER further (0.3% in ML) Need to investigate number of model parameters for large training sets EARS STT Meeting May 2004 Montreal 17