Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems

Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems Ariya Rastrow, Abhinav Sethy, Bhuvana Ramabhadran and Fred Jelinek Center for Language and Speech Processing IBM TJ Watson Research Lab September 9, 2009 Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 1 / 27

Outline Introduction 1 Introduction Why Sub-Word Units? Hybrid Systems Experimental Setup 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 2 / 27

Introduction Why Sub-Word Units? The simplest answer is : Recognizing OOV terms in ASR All LVCSR based systems have a closed word vocabulary Recognizer replaces OOV terms with the closest match in the vocabulary Neighboring words are also often misrecognized Contributing to recognition errors OOVs degrade the performance for later processing stages (e.g. translation,understanding, document retrieval,term detection) Although OOV rate might be relatively low in state of the art ASR systems, rare and unexpected events are information rich Eventual goal is to build an open vocabulary speech recognizer Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 3 / 27

Introduction Why Sub-Word Units? Fragments are sub-word units (variable length phone sequences) selected automatically using statistical methods(data-driven) See slides that follow Fragments have the potential to provide a good trade off between coverage and accuracy Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 4 / 27

Introduction Hybrid Systems Hybrid System Represents language as a combination of words and fragments Takes advantage of both word and fragment representations yielding improved performance while providing good coverage LM is built for such a representation Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 5 / 27

Introduction Hybrid Language Model in detail Hybrid Systems Step 1: Fragment selection based on N-gram pruning Convert LM training text (Exclude OOV) to phones, build N-gram (in our case 5-gram) phone LM and prune it (Entropy-based Pruning). Pruning selects the set of fragments (from single phones to 5-gram phones) Fragments IH N K L AA R K Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 6 / 27

Introduction Hybrid Systems Step 2: Converting word-based training data into Hybrid word/fragment data < s > THE BODY OF ZIYAD HAMDI WHO HAD BEEN SHOT WAS FOUND SOUTH OF THE CITY < /s > need to get pronunciation for OOV terms grapheme to phone models ZIYAD Z IY AE D HAMDI HH AE M D IY < s > THE BODY OF Z IY Y AE D HH AE M D IY WHO HAD BEEN SHOT WAS FOUND SOUTH OF THE CITY < /s > Fragment representation of OOV is obtained by left-to-right greedy search Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 7 / 27

Introduction Hybrid Language Model in detail Hybrid Systems Step 3: Build LM based on the Hybrid word/fragment set Treat fragments as individual terms After this step, Hybrid LM is built and we have a LM including both words and fragments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 8 / 27

Introduction Experimental Setup The LVCSR system is based on the 2007 IBM Speech transcription system for GALE Distillation Go/No-go Evaluation Acoustic Models are discriminatively trained on speaker adapted PLP features (best broadcast News acoustic models from IBM). The acoustic models are common for all systems in our experiments. The LM training text (for all systems) consists of 335M words from 8 sources of BN corpora. Both word and hybrid LMs are 4-gram LMs with Kneser-Ney smoothing Word lexicons ranging from 10K words to 84K were selected by sorting the words based on the frequency on the acoustic training data (broadcast news Hub4). Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 9 / 27

Continued Introduction Experimental Setup The set of fragments (sub-word units) is selected as described (5-gram phone LM) on the LM training text for each vocabulary size. The size of this set was fixed at roughly 20K for all systems. Therefore, the hybrid system includes 20K fragments, in addition to the words in its lexicon. We report the results: RT-04 BN evaluation set (45K words, 4.5 hours) as an in-domain test set MIT lectures data set (176K words, 21 hours, 20 lectures) as an out-of-domain test set Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 10 / 27

Introduction Experimental Setup OOV rates for different lexicon sizes Lexicon size 10k 20k 30k 40k 60k 84k RT-04 (%) 5.04 2.48 1.47 1.04 0.68 0.54 Lectures (%) 7.88 5.45 4.51 4.09 3.53 3.45 Table: OOV rates for the RT-04 set and the MIT lectures data Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 11 / 27

Outline Hybrid Systems for OOV Detection 1 Introduction 2 Hybrid Systems for OOV Detection Fragment Posteriors Using Consensus Evaluation Results 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 12 / 27

Hybrid Systems for OOV Detection The idea here is that since we have used fragments in the case of OOV for building our LM, then the appearance of fragments in the ASR output indicates an OOV region The simple case would be to search for the fragments in the decoder 1-best output A better way is to search for the fragments in the lattice Fragments allow us both to detect OOVs and to represent them ASR: TODAY TWO YOUNG GIANT PANDAS FROM CHINA ARRIVED ON A SPECIALLY R EH T R OW F IH T IH D FEDEX JET REF: TODAY TWO YOUNG GIANT PANDAS FROM CHINA ARRIVED ON A SPECIALLY RETROFITTED FEDEX JET Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 13 / 27

Hybrid Systems for OOV Detection Fragment Posteriors Using Consensus Fragment Posteriors Using Consensus Lattices are hard to deal with especially if you need their timings It would be easier to use the compact form of lattices Confusion Networks Having posterior probabilities for each hypothesis, we are able to observe the appearance of fragments and their likelihood. To identify OOV regions in the confusion network we can compute an OOV score : OOV score = p(f t j ) f {t j } where t j is a given bin of the confusion network and f s are fragments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 14 / 27

Hybrid Systems for OOV Detection Evaluating OOV detection Evaluation The ASR transcript(output) is compared to the reference transcript at the frame level [forced alignment] Each frame is assigned a score equal to the OOV score of the region it belongs to [previous slide] Each frame is tagged as belonging to an OOV or IV region. False alarm probabilities and miss probabilities on the set are shown in standard detection error trade-off(det) curves Entropy of bins inside confusion network is used as an OOV score for word systems Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 15 / 27

Hybrid Systems for OOV Detection Results 95 90 WRD!10k HYB!10k WRD!84k HYB!84k 80 Miss probability (in %) 60 40 20 10 0.1 0.2 0.5 1 2 5 10 False Alarm probability (in %) Figure: DET curves using hybrid and word system features Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 16 / 27

Outline Improving Phone Accuracy and Robustness 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness Phone Error Rate Results 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 17 / 27

Improving Phone Accuracy and Robustness There are many applications in HLT which need an accurate automatic phone recognizer e.g., Spoken term detection (STD) In STD task OOV terms (queries) can not be detected and retrieved. New techniques have been proposed which are all essentially based on the phonetic search for OOV queries. It is a well known fact that LVCSR based systems have better phone accuracy than phone recognizer systems with phone LM Question: Is adding new words (enlarging the dictionary size) the only way to improve phone accuracy? Sub-word units are not specific to a given domain/genre and reveal the phonetic structure of the language it is expected that applying them to out of domain data will substantially improve the phone accuracy. Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 18 / 27

Improving Phone Accuracy and Robustness Phone Error Rate Phone Error Rate (PER) computation is done using the NIST scoring tool The phone sequence in the 1-Best is aligned with the reference phone sequence The reference phone sequence is obtained by forced-alignment to the reference transcript Pronunciation of OOVs in the reference are obtained using letter to sound system. Oracle Phone error rate is also computed on the phonetic lattices. For this hybrid (word/fragment) lattices are converted to phonetic lattices In order to measure the contribution of the OOV regions to PER, PERoov PER is computed and shown Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 19 / 27

Improving Phone Accuracy and Robustness Results Phone Error Rate (PER) % 11 10.8 10.6 hybrid system 10.4 word system 10.2 10 9.8 9.6 9.4 9.2 9 8.8 8.6 8.4 8.2 8 7.8 10k 20k 30k 40k 60k 84k Lexicon Size Phone Error Rate (PER) % 17.1 16.9 16.7 16.5 hybrid system 16.3 word system 16.1 15.9 15.7 15.5 15.3 15.1 14.9 14.7 14.5 14.3 14.1 13.9 10k 20k 30k 40k 60k 84k Lexicon Size Figure: PER Results: (left) RT-04 (right) MIT Lectures (PER oov / PER) % 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 hybrid system word system 10k 20k 30k 40k 60k 84k Lexicon Size (PER oov / PER) % 34 32 30 28 26 24 22 20 18 16 14 12 10 hybrid system word system 10k 20k 30k 40k 60k 84k Lexicon Size Figure: PER in OOV regions as a percentage of the overall PER: (left) RT-04 (right) MIT Lectures Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 20 / 27

Continued Improving Phone Accuracy and Robustness Results Oracle Phone Error Rate % 6 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 RT 04 word system RT 04 hybrid system MIT word system MIT hybrid system 10k 20k 30k 40k 60k 84k Lexicon Size 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 Figure: Oracle PER of word/hybrid systems on RT-04, shown on the left Y-axis and the MIT data set shown on the right Y-axis Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 21 / 27

Outline From Sub-word units to Words 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words Results 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 22 / 27

From Sub-word units to Words We can not expect the customer to be satisfied with the hybrid output! FROM THE C. N. N. GLOBAL HEADQUARTERS IN ATLANTA I M CAROL K AA S T EH L OW (COSTELLO). THANKS YOU FOR WAKING UP WITH US Even though the hybrid output is much better and more understandable than: FROM THE C. N. N. GLOBAL HEADQUARTERS IN ATLANTA I M CAROL COX FELLOW (COSTELLO). THANKS YOU FOR WAKING UP WITH US 0+.%A!"327B%$2"'!"#$%&'6#*$2"%78' ;#11+7' 6#*$2"%78' ;#11+7'(0'!"#$%&'(%)*+,'-#./' (0'-+#1/.'23'4' 5/2"+'(%)*+,' 927:,'(%)*+,' 927:'(%)*+,' -#./'<=0>(0?' @A;+,.' L d D inv W Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 23 / 27

From Sub-word units to Words Results In our experiments, the 84k Lexicon and LM information are used as Meta-Information Vocab. Size 10k 20k 30k 40k 60k 84k Hybrid (%) 15.5 14.9 14.6 14.4 14.2 14.1 Word(%) 17.1 16 15.1 14.6 14.3 14.1 Table: WER on the RT-04 Eval set after back-transduction in previous slide Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 24 / 27

Outline Summary 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 25 / 27

Summary Showed: Basic method for fragment selection and building hybrid system Appearance of fragments in the output is a good indicator of OOV regions (improvement over entropy of bins from word system) Using fragments (along with words) improves the phone accuracy and can be helpful for STD task (for any lexicon size) Hybrid system trained on a generic domain (where sufficient training data is available) can be used on domains with low resources Hybrid system output is richer and is closer to the phonetic truth than the word system output Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 26 / 27

Summary Questions/Comments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 27 / 27