Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems

Similar documents
Experiments with Fisher Data

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

WAKE-UP-WORD SPOTTING FOR MOBILE SYSTEMS. A. Zehetner, M. Hagmüller, and F. Pernkopf

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Automatic Laughter Detection

Word Sense Disambiguation in Queries. Shaung Liu, Clement Yu, Weiyi Meng

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Reducing False Positives in Video Shot Detection

Analysis of Visual Similarity in News Videos with Robust and Memory-Efficient Image Retrieval

Probabilist modeling of musical chord sequences for music analysis

A Framework for Segmentation of Interview Videos

into a Cognitive Architecture

Music Recommendation from Song Sets

A probabilistic framework for audio-based tonal key and chord recognition

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

A Music Retrieval System Using Melody and Lyric

An Efficient Multi-Target SAR ATR Algorithm

1998 BROADCAST NEWS BENCHMARK TEST RESULTS: ENGLISH AND NON-ENGLISH WORD ERROR RATE PERFORMANCE MEASURES

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Normalization Methods for Two-Color Microarray Data

Week 14 Music Understanding and Classification

Retrieval of textual song lyrics from sung inputs

Indexing local features and instance recognition

Instance Recognition. Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision

Speech Recognition and Signal Processing for Broadcast News Transcription

Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections

IoT: Rethinking the reliability

Music Genre Classification

Automatic Laughter Detection

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Code-aided Frame Synchronization

White Paper Measuring and Optimizing Sound Systems: An introduction to JBL Smaart

THE importance of music content analysis for musical

Improving Frame Based Automatic Laughter Detection

Data Driven Music Understanding

MUSI-6201 Computational Music Analysis

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

@ Massachusetts Institute of Technology All rights reserved.

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Detecting Attempts at Humor in Multiparty Meetings

Introductions to Music Information Retrieval

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Outline. Why do we classify? Audio Classification

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

AUTOMATIC MAPPING OF SCANNED SHEET MUSIC TO AUDIO RECORDINGS

Music Alignment and Applications. Introduction

CSE 517 Natural Language Processing Winter 2013

Quantitative Evaluation of Pairs and RS Steganalysis

Instructions to Authors

Encoders and Decoders: Details and Design Issues

Machine Translation: Examples. Statistical NLP Spring MT: Evaluation. Phrasal / Syntactic MT: Examples. Lecture 7: Phrase-Based MT

Computational Modelling of Harmony

The Intervalgram: An Audio Feature for Large-scale Melody Recognition

Music Information Retrieval: An Inspirational Guide to Transfer from Related Disciplines

Precision testing methods of Event Timer A032-ET

Automatic Labelling of tabla signals

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

INFS 321 Information Sources

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

Statistical NLP Spring Machine Translation: Examples

Machine Translation: Examples. Statistical NLP Spring Levels of Transfer. Corpus-Based MT. World-Level MT: Examples

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Chapter Two: Long-Term Memory for Timbre

Labelling. Friday 18th May. Goldsmiths, University of London. Bayesian Model Selection for Harmonic. Labelling. Christophe Rhodes.

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

About... D 3 Technology TM.

Trevor de Clercq. Music Informatics Interest Group Meeting Society for Music Theory November 3, 2018 San Antonio, TX

ARTH 1112 Introduction to Film Fall 2015 SYLLABUS

COMBINING FORWARD AND BACKWARD SEARCH IN DECODING

Acoustic Prosodic Features In Sarcastic Utterances

Nearest-neighbor and Bilinear Resampling Factor Estimation to Detect Blockiness or Blurriness of an Image*

Digital Video Telemetry System

CHAPTER 8 CONCLUSION AND FUTURE SCOPE

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Draft 100G SR4 TxVEC - TDP Update. John Petrilla: Avago Technologies February 2014

LUMIGEN INSTRUMENT CENTER X-RAY CRYSTALLOGRAPHIC LABORATORY: WAYNE STATE UNIVERSITY

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal

Image Steganalysis: Challenges

Audio Feature Extraction for Corpus Analysis

Machine Translation Part 2, and the EM Algorithm

Toward Access to Multi-Perspective Archival Spoken Word Content

Previous Lecture Sequential Circuits. Slide Summary of contents covered in this lecture. (Refer Slide Time: 01:55)

JBL f s New Differential Drive Transducers for VerTec Subwoofer Applications:

WordCruncher Tools Overview WordCruncher Library Download an ebook or corpus Create your own WordCruncher ebook or corpus Share your ebooks or notes

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

APPLICATION OF PHASED ARRAY ULTRASONIC TEST EQUIPMENT TO THE QUALIFICATION OF RAILWAY COMPONENTS

158 ACTION AND PERCEPTION

FS1-X. Quick Start Guide. Overview. Frame Rate Conversion Option. Two Video Processors. Two Operating Modes

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Proposal for Application of Speech Techniques to Music Analysis

Statistical Machine Translation Lecture 5. Decoding with Phrase-Based Models

Phone-based Plosive Detection

From Once Upon a Time to Happily Ever After: Tracking Emotions in Novels and Fairy Tales. Saif Mohammad! National Research Council Canada

GatesAir Service Support

Hidden Markov Model based dance recognition

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Transcription:

Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems Ariya Rastrow, Abhinav Sethy, Bhuvana Ramabhadran and Fred Jelinek Center for Language and Speech Processing IBM TJ Watson Research Lab September 9, 2009 Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 1 / 27

Outline Introduction 1 Introduction Why Sub-Word Units? Hybrid Systems Experimental Setup 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 2 / 27

Introduction Why Sub-Word Units? The simplest answer is : Recognizing OOV terms in ASR All LVCSR based systems have a closed word vocabulary Recognizer replaces OOV terms with the closest match in the vocabulary Neighboring words are also often misrecognized Contributing to recognition errors OOVs degrade the performance for later processing stages (e.g. translation,understanding, document retrieval,term detection) Although OOV rate might be relatively low in state of the art ASR systems, rare and unexpected events are information rich Eventual goal is to build an open vocabulary speech recognizer Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 3 / 27

Introduction Why Sub-Word Units? Fragments are sub-word units (variable length phone sequences) selected automatically using statistical methods(data-driven) See slides that follow Fragments have the potential to provide a good trade off between coverage and accuracy Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 4 / 27

Introduction Hybrid Systems Hybrid System Represents language as a combination of words and fragments Takes advantage of both word and fragment representations yielding improved performance while providing good coverage LM is built for such a representation Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 5 / 27

Introduction Hybrid Language Model in detail Hybrid Systems Step 1: Fragment selection based on N-gram pruning Convert LM training text (Exclude OOV) to phones, build N-gram (in our case 5-gram) phone LM and prune it (Entropy-based Pruning). Pruning selects the set of fragments (from single phones to 5-gram phones) Fragments IH N K L AA R K Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 6 / 27

Introduction Hybrid Systems Step 2: Converting word-based training data into Hybrid word/fragment data < s > THE BODY OF ZIYAD HAMDI WHO HAD BEEN SHOT WAS FOUND SOUTH OF THE CITY < /s > need to get pronunciation for OOV terms grapheme to phone models ZIYAD Z IY AE D HAMDI HH AE M D IY < s > THE BODY OF Z IY Y AE D HH AE M D IY WHO HAD BEEN SHOT WAS FOUND SOUTH OF THE CITY < /s > Fragment representation of OOV is obtained by left-to-right greedy search Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 7 / 27

Introduction Hybrid Language Model in detail Hybrid Systems Step 3: Build LM based on the Hybrid word/fragment set Treat fragments as individual terms After this step, Hybrid LM is built and we have a LM including both words and fragments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 8 / 27

Introduction Experimental Setup The LVCSR system is based on the 2007 IBM Speech transcription system for GALE Distillation Go/No-go Evaluation Acoustic Models are discriminatively trained on speaker adapted PLP features (best broadcast News acoustic models from IBM). The acoustic models are common for all systems in our experiments. The LM training text (for all systems) consists of 335M words from 8 sources of BN corpora. Both word and hybrid LMs are 4-gram LMs with Kneser-Ney smoothing Word lexicons ranging from 10K words to 84K were selected by sorting the words based on the frequency on the acoustic training data (broadcast news Hub4). Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 9 / 27

Continued Introduction Experimental Setup The set of fragments (sub-word units) is selected as described (5-gram phone LM) on the LM training text for each vocabulary size. The size of this set was fixed at roughly 20K for all systems. Therefore, the hybrid system includes 20K fragments, in addition to the words in its lexicon. We report the results: RT-04 BN evaluation set (45K words, 4.5 hours) as an in-domain test set MIT lectures data set (176K words, 21 hours, 20 lectures) as an out-of-domain test set Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 10 / 27

Introduction Experimental Setup OOV rates for different lexicon sizes Lexicon size 10k 20k 30k 40k 60k 84k RT-04 (%) 5.04 2.48 1.47 1.04 0.68 0.54 Lectures (%) 7.88 5.45 4.51 4.09 3.53 3.45 Table: OOV rates for the RT-04 set and the MIT lectures data Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 11 / 27

Outline Hybrid Systems for OOV Detection 1 Introduction 2 Hybrid Systems for OOV Detection Fragment Posteriors Using Consensus Evaluation Results 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 12 / 27

Hybrid Systems for OOV Detection The idea here is that since we have used fragments in the case of OOV for building our LM, then the appearance of fragments in the ASR output indicates an OOV region The simple case would be to search for the fragments in the decoder 1-best output A better way is to search for the fragments in the lattice Fragments allow us both to detect OOVs and to represent them ASR: TODAY TWO YOUNG GIANT PANDAS FROM CHINA ARRIVED ON A SPECIALLY R EH T R OW F IH T IH D FEDEX JET REF: TODAY TWO YOUNG GIANT PANDAS FROM CHINA ARRIVED ON A SPECIALLY RETROFITTED FEDEX JET Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 13 / 27

Hybrid Systems for OOV Detection Fragment Posteriors Using Consensus Fragment Posteriors Using Consensus Lattices are hard to deal with especially if you need their timings It would be easier to use the compact form of lattices Confusion Networks Having posterior probabilities for each hypothesis, we are able to observe the appearance of fragments and their likelihood. To identify OOV regions in the confusion network we can compute an OOV score : OOV score = p(f t j ) f {t j } where t j is a given bin of the confusion network and f s are fragments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 14 / 27

Hybrid Systems for OOV Detection Evaluating OOV detection Evaluation The ASR transcript(output) is compared to the reference transcript at the frame level [forced alignment] Each frame is assigned a score equal to the OOV score of the region it belongs to [previous slide] Each frame is tagged as belonging to an OOV or IV region. False alarm probabilities and miss probabilities on the set are shown in standard detection error trade-off(det) curves Entropy of bins inside confusion network is used as an OOV score for word systems Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 15 / 27

Hybrid Systems for OOV Detection Results 95 90 WRD!10k HYB!10k WRD!84k HYB!84k 80 Miss probability (in %) 60 40 20 10 0.1 0.2 0.5 1 2 5 10 False Alarm probability (in %) Figure: DET curves using hybrid and word system features Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 16 / 27

Outline Improving Phone Accuracy and Robustness 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness Phone Error Rate Results 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 17 / 27

Improving Phone Accuracy and Robustness There are many applications in HLT which need an accurate automatic phone recognizer e.g., Spoken term detection (STD) In STD task OOV terms (queries) can not be detected and retrieved. New techniques have been proposed which are all essentially based on the phonetic search for OOV queries. It is a well known fact that LVCSR based systems have better phone accuracy than phone recognizer systems with phone LM Question: Is adding new words (enlarging the dictionary size) the only way to improve phone accuracy? Sub-word units are not specific to a given domain/genre and reveal the phonetic structure of the language it is expected that applying them to out of domain data will substantially improve the phone accuracy. Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 18 / 27

Improving Phone Accuracy and Robustness Phone Error Rate Phone Error Rate (PER) computation is done using the NIST scoring tool The phone sequence in the 1-Best is aligned with the reference phone sequence The reference phone sequence is obtained by forced-alignment to the reference transcript Pronunciation of OOVs in the reference are obtained using letter to sound system. Oracle Phone error rate is also computed on the phonetic lattices. For this hybrid (word/fragment) lattices are converted to phonetic lattices In order to measure the contribution of the OOV regions to PER, PERoov PER is computed and shown Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 19 / 27

Improving Phone Accuracy and Robustness Results Phone Error Rate (PER) % 11 10.8 10.6 hybrid system 10.4 word system 10.2 10 9.8 9.6 9.4 9.2 9 8.8 8.6 8.4 8.2 8 7.8 10k 20k 30k 40k 60k 84k Lexicon Size Phone Error Rate (PER) % 17.1 16.9 16.7 16.5 hybrid system 16.3 word system 16.1 15.9 15.7 15.5 15.3 15.1 14.9 14.7 14.5 14.3 14.1 13.9 10k 20k 30k 40k 60k 84k Lexicon Size Figure: PER Results: (left) RT-04 (right) MIT Lectures (PER oov / PER) % 32 30 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 hybrid system word system 10k 20k 30k 40k 60k 84k Lexicon Size (PER oov / PER) % 34 32 30 28 26 24 22 20 18 16 14 12 10 hybrid system word system 10k 20k 30k 40k 60k 84k Lexicon Size Figure: PER in OOV regions as a percentage of the overall PER: (left) RT-04 (right) MIT Lectures Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 20 / 27

Continued Improving Phone Accuracy and Robustness Results Oracle Phone Error Rate % 6 5.8 5.6 5.4 5.2 5 4.8 4.6 4.4 4.2 4 RT 04 word system RT 04 hybrid system MIT word system MIT hybrid system 10k 20k 30k 40k 60k 84k Lexicon Size 8.8 8.6 8.4 8.2 8 7.8 7.6 7.4 7.2 7 6.8 6.6 6.4 6.2 Figure: Oracle PER of word/hybrid systems on RT-04, shown on the left Y-axis and the MIT data set shown on the right Y-axis Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 21 / 27

Outline From Sub-word units to Words 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words Results 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 22 / 27

From Sub-word units to Words We can not expect the customer to be satisfied with the hybrid output! FROM THE C. N. N. GLOBAL HEADQUARTERS IN ATLANTA I M CAROL K AA S T EH L OW (COSTELLO). THANKS YOU FOR WAKING UP WITH US Even though the hybrid output is much better and more understandable than: FROM THE C. N. N. GLOBAL HEADQUARTERS IN ATLANTA I M CAROL COX FELLOW (COSTELLO). THANKS YOU FOR WAKING UP WITH US 0+.%A!"327B%$2"'!"#$%&'6#*$2"%78' ;#11+7' 6#*$2"%78' ;#11+7'(0'!"#$%&'(%)*+,'-#./' (0'-+#1/.'23'4' 5/2"+'(%)*+,' 927:,'(%)*+,' 927:'(%)*+,' -#./'<=0>(0?' @A;+,.' L d D inv W Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 23 / 27

From Sub-word units to Words Results In our experiments, the 84k Lexicon and LM information are used as Meta-Information Vocab. Size 10k 20k 30k 40k 60k 84k Hybrid (%) 15.5 14.9 14.6 14.4 14.2 14.1 Word(%) 17.1 16 15.1 14.6 14.3 14.1 Table: WER on the RT-04 Eval set after back-transduction in previous slide Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 24 / 27

Outline Summary 1 Introduction 2 Hybrid Systems for OOV Detection 3 Improving Phone Accuracy and Robustness 4 From Sub-word units to Words 5 Summary Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 25 / 27

Summary Showed: Basic method for fragment selection and building hybrid system Appearance of fragments in the output is a good indicator of OOV regions (improvement over entropy of bins from word system) Using fragments (along with words) improves the phone accuracy and can be helpful for STD task (for any lexicon size) Hybrid system trained on a generic domain (where sufficient training data is available) can be used on domains with low resources Hybrid system output is richer and is closer to the phonetic truth than the word system output Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 26 / 27

Summary Questions/Comments Rastrow, Sethy, Ramabhadran and Jelinek September 9, 2009 27 / 27