First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

Similar documents
Improving Frame Based Automatic Laughter Detection

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Music Composition with RNN

Automatic Laughter Detection

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Image-to-Markup Generation with Coarse-to-Fine Attention

Singer Traits Identification using Deep Neural Network

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Joint Image and Text Representation for Aesthetics Analysis

Singing voice synthesis based on deep neural networks

Retrieval of textual song lyrics from sung inputs

Automatic Speech Recognition (CS753)

Automatic Laughter Detection

Effects of acoustic degradations on cover song recognition

An Introduction to Deep Image Aesthetics

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

arxiv: v1 [cs.lg] 15 Jun 2016

A Note Based Query By Humming System using Convolutional Neural Network

LSTM Neural Style Transfer in Music Using Computational Musicology

Humor recognition using deep learning

Detecting Attempts at Humor in Multiparty Meetings

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Audio: Generation & Extraction. Charu Jaiswal

Music Similarity and Cover Song Identification: The Case of Jazz

arxiv: v1 [cs.ir] 16 Jan 2019

Automatic Music Genre Classification

HumorHawk at SemEval-2017 Task 6: Mixing Meaning and Sound for Humor Recognition

Statistical Modeling and Retrieval of Polyphonic Music

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

An AI Approach to Automatic Natural Music Transcription

FOIL it! Find One mismatch between Image and Language caption

Automatic Piano Music Transcription

Music Genre Classification

Satoshi Iizuka* Edgar Simo-Serra* Hiroshi Ishikawa Waseda University. (*equal contribution)

Sarcasm Detection in Text: Design Document

Deep Learning of Audio and Language Features for Humor Prediction

arxiv: v1 [cs.cl] 3 May 2018

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

The Million Song Dataset

CS229 Project Report Polyphonic Piano Transcription

Chord Classification of an Audio Signal using Artificial Neural Network

Data-Driven Solo Voice Enhancement for Jazz Music Retrieval

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

Real-valued parametric conditioning of an RNN for interactive sound synthesis

Chairs: Josep Lladós (CVC, Universitat Autònoma de Barcelona)

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

Neural Network for Music Instrument Identi cation

Computational modeling of conversational humor in psychotherapy

A. Ideal Ratio Mask If there is no RIR, the IRM for time frame t and frequency f can be expressed as [17]: ( IRM(t, f) =

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

Music Generation from MIDI datasets

Phone-based Plosive Detection

Multi-modal Analysis for Person Type Classification in News Video

Audio Cover Song Identification using Convolutional Neural Network

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

Computational Modelling of Harmony


Generating Music with Recurrent Neural Networks

Experimenting with Musically Motivated Convolutional Neural Networks

CALCULATING SIMILARITY OF FOLK SONG VARIANTS WITH MELODY-BASED FEATURES

Lecture 9 Source Separation

Paulo V. K. Borges. Flat 1, 50A, Cephas Av. London, UK, E1 4AR (+44) PRESENTATION

A repetition-based framework for lyric alignment in popular songs

Towards a Complete Classical Music Companion

Acoustic Scene Classification

Speech To Song Classification

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

A New Scheme for Citation Classification based on Convolutional Neural Networks

Acoustic and musical foundations of the speech/song illusion

MUSIC scores are the main medium for transmitting music. In the past, the scores started being handwritten, later they

Rewind: A Music Transcription Method

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Music Information Retrieval Community

Modeling Musical Context Using Word2vec

Representations in Deep Neural Nets. Paul Humphreys July

Music Theory Inspired Policy Gradient Method for Piano Music Transcription

A Survey of Audio-Based Music Classification and Annotation

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

arxiv: v2 [cs.sd] 31 Mar 2017

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

MUSI-6201 Computational Music Analysis

arxiv: v1 [cs.ir] 20 Mar 2019

Generating Chinese Classical Poems Based on Images

arxiv: v1 [cs.sd] 5 Apr 2017

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

Detecting Musical Key with Supervised Learning

A real time study of plosives in Glaswegian using an automatic measurement algorithm

COMPARING RNN PARAMETERS FOR MELODIC SIMILARITY

A Unit Selection Methodology for Music Generation Using Deep Neural Networks

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Audio Feature Extraction for Corpus Analysis

Deep Jammer: A Music Generation Model

arxiv: v1 [cs.sd] 8 Jun 2016

Transcription:

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart March 16, 2017

Slot Filling sequential labelling task to assign semantic labels to each word in an input sequence key query terms fill a semantic frame or slot e.g. locations, time periods benchmark corpus: Airline Travel Information Systems (ATIS) state-of-the-art DNN models yield around 95% F1-Score typical features: word embeddings (lexico-semantic representations) example: SHOW O FLIGHTS O FROM O BURBANK B-fromloc.city name TO O MILWAUKEE B-toloc.city name FOR O TODAY B-depart date.today relative Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 2

Motivation slot filling is a text-based task, however: spoken language understanding (SLU) involves automatic speech recognition (ASR) as first step realistic setting: apply and optimize on ASR output, taking recognition error into account related work shows that slot filling performance drops on recognized text additional information that is extracted from the speech signal and not present in text may help prosodic information, e.g. pitch accents Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 3

Pitch Accents in Slot Filling certain words are marked as salient to highlight important information (focus, contrast, information status) pitch accents are useful for various NLP and SLU tasks: named entity recognition, coreference resolution, dialog act segmentation, etc. human listeners may recover recognition errors using context information and prosodic cues content words with new information status are typically pitch accented e.g. List FLIGHTS from DALLAS to HOUSTON a previous study has shown that words with automatically predicted pitch accents account for 90% of the slots in a subset of ATIS (Stehwien & Vu, 2016) Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 4

Bidirectional Recurrent Neural Network with Ranking Loss (Vu et al. 2015) bi-directionality: combination of forward and backward hidden layer models past and future context ranking loss function maximizes distance betwen true label and best target 100-dimensional word embeddings 95.56% F1-score on ATIS Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 5

Bidirectional Sequential Convolutional Neural Network (Vu, 2016) combination of two CNNs that model past and future contexts respectively additional surrounding context gives current word more weight 50-dimensional word embeddings 95.61% F1-score Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 6

Word Embeddings with Pitch Accent Extensions word embeddings are vector representations of words based on their lexical and semantic context word embedding of w concatenated with a binary flag indicating the absence or presence of a pitch accent on w: embs(w) = [lexical embs(w), pitch accent flag(w)] (1) combines acoustic-prosodic information and lexico-semantic word embeddings Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 7

Method recognize ATIS corpus from audio signal with ASR (7% WER) obtain the word, syllable, and phone alignments pitch accent detector determines the binary label for each word the word embeddings are trained and concatenated with the binary pitch accent flag compare slot filling performance on original transcriptions and recognized version Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 8

Pitch Accents in ATIS analyze co-occurence of (predicted) pitch accents and slots in ATIS compare on manual transcriptions and recognized test set almost 93% of slots are pitch accented in both versions manual recognized # words 9551 9629 # slots 3663 3560 pred. accents on slots 64.1% 64.0% slots with pred. accent 92.7% 92.9% Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 9

Pitch Accents in Neural Models: Results results on ASR output are much worse than on manual transcriptions pitch accent extensions do not help on original text context information suffices pitch accent extensions slightly improve F1-score on ASR output RNN CNN Transcriptions (lexical word embeddings) 94.97 95.25 + pitch accent extensions 94.98 95.25 ASR output (lexical word embeddings) 89.55 89.13 + pitch accent extensions 90.04 89.57 Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 10

Analysis unknown tokens replace words in the benchmark dataset that occur only once the ASR system also produces more unknown tokens due recognition errors analysis of RNN results on unkown tokens, independent of slot type: baseline: 43% correct with pitch accent extensions: 51% correct indicates that pitch accent information helped to localize a slot, even though the actual label may be incorrect unknown tokens may still carry helpful information that is captured by this method Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 11

Examples reference recognized ref. slots with accents baseline I NEED THE FLIGHTS FROM WASHINGTON TO MONTREAL ON A SATURDAY I NEED THE FLIGHTS FROM <UNK> TO MONTREAL ON SATURDAY O O O O O B-fromloc.city name O B-toloc.city name O B-depart date.day name O O O O O B-fromloc.city name O B-toloc.city name O B-depart date.day name O O O O O O O B-toloc.city name O B-depart date.day name unknown token is labelled correctly reference WHICH AIRLINES FLY BETWEEN TORONTO AND SAN DIEGO recognized WHICH AIRLINES FLY BETWEEN TO ROUND <UNK> AND SAN DIEGO ref. slots O O O O O O O O B-toloc.city name I-toloc.city name with accents O O O O O O O O B-toloc.city name I-toloc.city name baseline O O O O B-fromloc.city name B-round trip I-round trip O B-toloc.city name... misrecognized words are labelled more appropriately Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 12

Conclusion we addressed the notion of overcoming the performance drop of state-of-the-art slot filling methods on speech recognition output extended word embedding vectors with pitch accent features small but positive effects were obtained on two models (RNN and CNN) limited and closed-domain nature of ATIS may be accountable for small differences evidence that pitch accent features may help in the case of misrecognized or unknown words Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 13

References N.T. Vu et al. (2015) Bi-directional Recurrent Neural Network with Ranking Loss for Spoken Language Understanding IEEE Transactions on Audio, Speech and Language Processing N. T. Vu (2016) Sequential Convolutional Neural Networks for Slot Filling in Spoken Language Understanding Proceedings of Interspeech G. Mesnil et al. (2015) Using Recurrent Neural Networks for Slot Filling in Spoken Language Understanding IEEE Transactions on Audio, Speech and Language Processing S. Stehwien and N. T. Vu (2016) Explorint the Correlation of Pitch Accents and Semantic Slots for Spoken Language Understanding Proceedings of Interspeech A. Schweitzer (2010) Production and Perception of Prosodic Events - Evindence from Corpus-based Experiments Ph.D. thesis, Universität Stuttgart Sabrina Stehwien, Ngoc Thang Vu IMS, University of Stuttgart 14