Speech Recognition Combining MFCCs and Image Features

Similar documents
INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Automatic Rhythmic Notation from Single Voice Audio Sources

Speech and Speaker Recognition for the Command of an Industrial Robot

Chord Classification of an Audio Signal using Artificial Neural Network

Singer Identification

Automatic Identification of Instrument Type in Music Signal using Wavelet and MFCC

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Brain-actuated Control of Wheelchair Using Fuzzy Neural Networks

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

A Parallel Multilevel-Huffman Decompression Scheme for IP Cores with Multiple Scan Chains

Semi-supervised Musical Instrument Recognition

PRODUCTION MACHINERY UTILIZATION MONITORING BASED ON ACOUSTIC AND VIBRATION SIGNAL ANALYSIS

Automatic Laughter Detection

Recognising Cello Performers using Timbre Models

Improving Frame Based Automatic Laughter Detection

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

A Real-time Framework for Video Time and Pitch Scale Modification

MUSI-6201 Computational Music Analysis

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

Singer Traits Identification using Deep Neural Network

Multimodal Music Mood Classification Framework for Christian Kokborok Music

Automatic Laughter Detection

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

Recognising Cello Performers Using Timbre Models

MUSICAL NOTE AND INSTRUMENT CLASSIFICATION WITH LIKELIHOOD-FREQUENCY-TIME ANALYSIS AND SUPPORT VECTOR MACHINES

Acoustic Scene Classification

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

LB3-PCx50 Premium Cabinet Loudspeakers

Detecting Musical Key with Supervised Learning

A NOVEL CEPSTRAL REPRESENTATION FOR TIMBRE MODELING OF SOUND SOURCES IN POLYPHONIC MIXTURES

Vadim V. Romanuke * (Professor, Polish Naval Academy, Gdynia, Poland)

Audio spectrogram representations for processing with Convolutional Neural Networks

Music Genre Classification and Variance Comparison on Number of Genres

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

A Large Scale Experiment for Mood-Based Classification of TV Programmes

GCT535- Sound Technology for Multimedia Timbre Analysis. Graduate School of Culture Technology KAIST Juhan Nam

Mood Tracking of Radio Station Broadcasts

Book: Fundamentals of Music Processing. Audio Features. Book: Fundamentals of Music Processing. Book: Fundamentals of Music Processing

CS229 Project Report Polyphonic Piano Transcription

Automatic Music Genre Classification

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

Figure 1: Feature Vector Sequence Generator block diagram.

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

MUSICAL INSTRUMENT RECOGNITION USING BIOLOGICALLY INSPIRED FILTERING OF TEMPORAL DICTIONARY ATOMS

Digital Signal Processing. Prof. Dietrich Klakow Rahil Mahdian

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

TOWARD UNDERSTANDING EXPRESSIVE PERCUSSION THROUGH CONTENT BASED ANALYSIS

WE ADDRESS the development of a novel computational

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis

Automatic Piano Music Transcription

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Blind Identification of Source Mobile Devices Using VoIP Calls

Singer Recognition and Modeling Singer Error

MUSICAL INSTRUMENTCLASSIFICATION USING MIRTOOLBOX

Introduction to image compression

Classification of Timbre Similarity

Neural Network for Music Instrument Identi cation

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

Some Experiments in Humour Recognition Using the Italian Wikiquote Collection

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

DINION 5000 AN. Video DINION 5000 AN. Ultra high resolution 960H sensor

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

Supervised Learning in Genre Classification

Optimized Color Based Compression

Neural Network Predicating Movie Box Office Performance

DINION 5000 AN. Video DINION 5000 AN. Ultra high resolution 960H sensor

Automatic Classification of Instrumental Music & Human Voice Using Formant Analysis

A. Ideal Ratio Mask If there is no RIR, the IRM for time frame t and frequency f can be expressed as [17]: ( IRM(t, f) =

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Toward Multi-Modal Music Emotion Classification

Fusion for Audio-Visual Laughter Detection

VECTOR REPRESENTATION OF EMOTION FLOW FOR POPULAR MUSIC. Chia-Hao Chung and Homer Chen

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

A Survey of Audio-Based Music Classification and Annotation

Multi-modal Analysis for Person Type Classification in News Video

Music Information Retrieval for Jazz

Voice Controlled Car System

A New Compression Scheme for Color-Quantized Images

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

INTRA-FRAME WAVELET VIDEO CODING

MOVIES constitute a large sector of the entertainment

A Survey on: Sound Source Separation Methods

Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

Normalized Cumulative Spectral Distribution in Music

DINION AN H. Video DINION AN Ultra high resolution 960H sensor

Copy Move Image Forgery Detection Method Using Steerable Pyramid Transform and Texture Descriptor

Audio-Based Video Editing with Two-Channel Microphone

Non Stationary Signals (Voice) Verification System Using Wavelet Transform

Transcription:

Speech Recognition Combining MFCCs and Image Featres S. Karlos from Department of Mathematics N. Fazakis from Department of Electrical and Compter Engineering K. Karanikola from Department of Mathematics S. Kotsiantis from Department of Mathematics K. Sgarbas from Department of Electrical and Compter Enginnering University of Patras, Greece

Aim Combination of adio signal and image featres Exploitation of larger frames for speech signals Increase of classification accracy withot sing complex algorithms

Contents Speaker Identification problem Attribtes of speech signals Examine Content Based Image Featres (CBIR) Combination of MFCCs + CBIR Experiments Conclsion

Speaker Identification Problem Determines the speaker from a set of registered speakers q This is called a closed set identification q Reslt is the best speaker matched What if the speaker is not in the database? q This is called an open set identification q Reslt can be a speaker or a no-match reslt Or experiment is a closed set identification problem

Extraction of adio characteristics Different representations of speech signals: 1. Mel-Freqency Cepstral Coefficients (MFCC) 2. Linear Predictive Codes (LPCs) 3. Perceptal Linear Prediction (PLP) 4. PLP-Relative Spectra (PLP-RASTA) Non-linear behavior of speech Need for adapting signal to hman ear scale Most efficient soltion: MFCCs featres

Extraction of image characteristics Spectrogram: time-freqency representation of an adio signal Short-Term Forier Transform (STFT) Different approaches of image processing : 1. Content-Based 2. Featre-Based 3. Appearance-Based Determine the similarity throgh distances of featre vectors

Related works Content Based Image Processing (CBIR) techniqes have been widely sed Exploitation of color content and textre information Most known approaches: 1. Local gradient featres along with PCA + HMMs 2. Delta MFCCs 3. 2D Gabor Featres + MLP 4. Featre-Finding Neral Network (FFNN) 5. Wavelet package transform + MKL 6. RANSAC algorithm

Proposed Techniqe 1 st view Acqire the first 25 coefficients of MFCCs (0 th has been rejected) Hamming window has been preferred Time dration of each frame eqals to 0.5 seconds Overlap factor eqals to 50% Highest band edge of Mel filters eqals to 4kHz Use of 40 warped spectral bands Logarithmical scale of magnitde spectrm Discrete Cosine Transformation (DCT)

Proposed Techniqe 2 nd view Use of AtoColorCorrelogramFilter (atocor) a # " I = γ # # "," I, γ "),"* I = Pr.) 0"),.* 0 p * I "2 dist p ), p * = k Spatial correlation of colors from each image is distilled Not based on prely local properties Effective in recognizing large changes of shape Efficiently compted

MFCCs + atocor + SVM

Proposed Techniqe Learning stage Spport Vector Machines (SVMs) Hyperplanes that separate two classes Maximizing the margin for redcing the generalization error Can deal with very high dimensional data Efficient implementation throgh LibSVM library Use of polynomial kernel (degree = 3)

Data CHAINS Corps Selected mode: Solo speech 36 speakers (28 from Eastern Ireland 8 from UK and USA) 19 different sentences ot of the 33 3 scenarios: 8, 16 and 36 speakers Eqal male and female speakers dring each scenario

Experimental procedre Comparison with another 9 image filters Spervised classifiers: 1. SVMs 2. Mlti-Layer Perceptron (MLP) 3. Logistic Regression (LogReg) 10-cross-validation techniqe WEKA tool was sed along with libraries of Lcene Image Retrieval (LIRe) Record comptational time (Intel i3 64bit system - 8GB RAM)

Experimental procedre CBIR Filters Initial Nmber of featres Usefl Nmber of featres atocor 1024 57 binpyr 756 131 clay 33 33 edhist 80 80 fcth 192 18 fzzy 576 17 gabor 60 60 jpeg 192 192 phog 630 44 simplehist 64 11 Redction of dimensionality: Remove seless attribtes Size of datasets on instances has been redced dramatically: q 8speakers: abot 32.000 -> 1.298 q 16speakers: abot 65.000 -> 2.577 q 36speakers: abot 146.000 -> 5.818

Reslts 8 speakers 16 speakers 36 speakers Classifiers MFCCs MFCCs + atocor MFCCs MFCCs + atocor MFCCs MFCCs + atocor SVM 79.89 87.44 75.90 83.70 66.74 76.64 Time(sec) 0.45 0.88 1.29 2.09 5.93 9.62 MLP 69.49 82.42 69.03 80.36 60.1581 66.33 Time(sec) 10.71 60.80 35.43 121.04 179.89 452.50 LogReg 66.41 76.96 73.38 79.74 60.89 67.13 Time(sec) 0.26 1.08 1.71 4.06 5.46 27.98

Statistical comparison q q Post-hoc test of Nemenyi CD s length depicts the needed distance for significant difference

Experiments A boost of accracy was recorded for all the tested scenarios 11.5%, 7.8% and 9.9% improvement compared with standalone MFCCs Bilding of classification model demands a few seconds Fzzy filtering techniqes performed flctations MFCCs+atocor and MFCCs+binpyr achieved the best reslts The proposed techniqe reqires mch less comptational resorces

Conclsions Tackle with Atomatic Speech Recognition (ASR) tasks Increase the featre vector of adio signals Redce the training time Methods based on local featres performed poor reslts Improved generalization behavior for the most SI filters

Promising points Extract more specialized featres nder MFCCs + SI featres scheme Parallel implementation Apply mlti-view Semi-spervised techniqes Combination of magnitde with phase related featres (Hartley Phase Spectrm)

References M. Lx and S. A. Chatzichristofis, Lire: lcene image retrieval, Proceeding 16th ACM Int. Conf. Mltimed. - MM 08, p. 1085, 2008. F. Cmmins, M. Grimaldi, T. Leonard, and J. Simko, The CHAINS Speech Corps: CHAracterizing INdividal Speakers, Proc SPECOM, pp. 1 6, 2006 J. Dennis, H. D. Tran, and H. Li, Spectrogram Image Featre for Sond Event Classification in Mismatched Conditions, IEEE Signal Process. Lett., vol. 18, no. 2, pp. 130 133, Feb. 2011 M. Mayo, ImageFilter WEKA filter that ses LIRE to extract image featres, 2015. [Online]. Available: https://githb.com/mmayo888/imagefilter I. Paraskevas and M. Rangossi, The hartley phase spectrm as an assistive featre for classification, Lect. Notes Compt. Sci. (inclding Sbser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 5933 LNAI, pp. 51 59, 2010