On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

Similar documents
On human capability and acoustic cues for discriminating singing and speaking voices

Subjective Similarity of Music: Data Collection for Individuality Analysis

1. Introduction NCMMSC2009

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

Supervised Learning in Genre Classification

Comparison Parameters and Speaker Similarity Coincidence Criteria:

Classification of Timbre Similarity

Subjective evaluation of common singing skills using the rank ordering method

638 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010

Automatic Rhythmic Notation from Single Voice Audio Sources

Features for Audio and Music Classification

Musical Instrument Recognizer Instrogram and Its Application to Music Retrieval based on Instrumentation Similarity

SINCE the lyrics of a song represent its theme and story, they

Music Perception with Combined Stimulation

MUSI-6201 Computational Music Analysis

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Unisoner: An Interactive Interface for Derivative Chorus Creation from Various Singing Voices on the Web

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Toward Music Listening Interfaces in the Future

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

THE importance of music content analysis for musical

Music Information Retrieval Community

VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

CULTIVATING VOCAL ACTIVITY DETECTION FOR MUSIC AUDIO SIGNALS IN A CIRCULATION-TYPE CROWDSOURCING ECOSYSTEM

MODELS of music begin with a representation of the

Content-based music retrieval

Music Information Retrieval. Juan P Bello

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Automatic Laughter Detection

Unisoner: An Interactive Interface for Derivative Chorus Creation from Various Singing Voices on the Web

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

A NOVEL CEPSTRAL REPRESENTATION FOR TIMBRE MODELING OF SOUND SOURCES IN POLYPHONIC MIXTURES

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Content-based Music Structure Analysis with Applications to Music Semantics Understanding

Automatic Laughter Detection

Advanced Signal Processing 2

Singer Traits Identification using Deep Neural Network

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high.

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Singing voice synthesis based on deep neural networks

A Survey of Audio-Based Music Classification and Annotation

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

/$ IEEE

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

A New Method for Calculating Music Similarity

ISSN ICIRET-2014


Transcription of the Singing Melody in Polyphonic Music

Automatic discrimination between laughter and speech

Musical Similarity and Commonness Estimation Based on Probabilistic Generative Models of Musical Elements

Drumix: An Audio Player with Real-time Drum-part Rearrangement Functions for Active Music Listening

Musical Instrument Identification based on F0-dependent Multivariate Normal Distribution

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Singer Recognition and Modeling Singer Error

Music Recommendation from Song Sets

Topics in Computer Music Instrument Identification. Ioanna Karydi

pitch estimation and instrument identification by joint modeling of sustained and attack sounds.

IMPROVING MARKOV MODEL-BASED MUSIC PIECE STRUCTURE LABELLING WITH ACOUSTIC INFORMATION

Automatic music transcription

HUMANS have a remarkable ability to recognize objects

CTP431- Music and Audio Computing Music Information Retrieval. Graduate School of Culture Technology KAIST Juhan Nam

Acoustic Scene Classification

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Improving Frame Based Automatic Laughter Detection

Music structure information is

HIT SONG SCIENCE IS NOT YET A SCIENCE

Topic 10. Multi-pitch Analysis

AUTOMASHUPPER: AN AUTOMATIC MULTI-SONG MASHUP SYSTEM

CS 591 S1 Computational Audio

HUMAN PERCEPTION AND COMPUTER EXTRACTION OF MUSICAL BEAT STRENGTH

Impact of Frame Loss Aspects of Mobile Phone Networks on Forensic Voice Comparison

ARECENT emerging area of activity within the music information

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

An Accurate Timbre Model for Musical Instruments and its Application to Classification

SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG. Sangeon Yong, Juhan Nam

Noise evaluation based on loudness-perception characteristics of older adults

Singing Voice Detection for Karaoke Application

TIMBRE-CONSTRAINED RECURSIVE TIME-VARYING ANALYSIS FOR MUSICAL NOTE SEPARATION

Hidden Markov Model based dance recognition

Pitch is one of the most common terms used to describe sound.

Data Driven Music Understanding

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

POLYPHONIC TRANSCRIPTION BASED ON TEMPORAL EVOLUTION OF SPECTRAL SIMILARITY OF GAUSSIAN MIXTURE MODELS

Automatic Classification of Instrumental Music & Human Voice Using Formant Analysis

Repeating Pattern Discovery and Structure Analysis from Acoustic Music Data

Perceptual dimensions of short audio clips and corresponding timbre features

Aalborg Universitet. Feature Extraction for Music Information Retrieval Jensen, Jesper Højvang. Publication date: 2009

AUTOMATIC RECOGNITION OF LAUGHTER

Music Information Retrieval with Temporal Features and Timbre

An Examination of Foote s Self-Similarity Method

Parameter Estimation of Virtual Musical Instrument Synthesizers

Music Genre Classification and Variance Comparison on Number of Genres

Transcription:

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices Yasunori Ohishi 1 Masataka Goto 3 Katunobu Itou 2 Kazuya Takeda 1 1 Graduate School of Information Science, Nagoya University, Japan 2 Faculty of Computer and Information Sciences, Hosei University, Japan 3 National Institute of Advanced Industrial Science and Technology

Let s do the Quiz Can you discriminate between Singing and Speaking voices? (Japanese voices) Q.1. Can you do it? (2 s long) Q.2. Can you do it? (500 ms long) Q.3. Can you do it? (200 ms long)

Correct rate [%] 100 95 90 85 80 75 70 65 60 Investigation of signal length necessary for discrimination 1-s voice signals 500-ms voice signals 200-ms voice signals Singing performance Speaking performance Total performance 200 500 1000 1500 2000 Signal length [ms]

Correct rate [%] 100 95 90 85 80 75 70 65 60 Investigation of signal length necessary for discrimination 1-s voice signals 500-ms voice signals 200-ms voice signals Not only temporal characteristics Singing performance but also such short-term features Speaking performance carry discriminative Total performance cues 200 500 1000 1500 2000 Signal length [ms]

The goal of this study Subjective experiments Investigation of acoustic cues necessary for discrimination between singing and speaking voices Based on knowledge obtained by subjective experiments Automatic vocal style discriminator Spectral feature measure F0 derivative measure

Introduction of the voice database AIST humming database 75 Japanese subjects (37 males, 38 females) Sing a chorus and verse A sections at an arbitrary tempo, without musical accompaniment ( 25 Japanese songs selected from RWC Music Database: Popular Music ) Read the lyrics of chorus and verse A sections Most of these subjects haven t had the special musical training

The goal of this study Subjective experiments Investigation of acoustic cues necessary for discrimination between singing and speaking voices Based on knowledge obtained by subjective experiments Automatic vocal style discriminator Short-term spectral feature measure F0 derivative measure

Amplitude Investigation of acoustic cues necessary for discrimination To compare the importance of temporal and spectral cues for discrimination, voice quality and prosody are modified by using signal processing techniques Temporal structure of signal is modified, short-time spectral features are maintained Random splicing technique Randomly concatenating pieces 250 ms 1 s Let s do the quiz Q.1 Q.2 Q.3 (250 ms) (200 ms) (125 ms)

Frequency Frequency Investigation of acoustic cues necessary for discrimination To compare the importance of temporal and spectral cues for discrimination, voice quality and prosody are modified by using signal processing techniques Temporal structure of signal is maintained, short-time spectral features are modified Low-pass filtering technique Eliminating frequency component higher than 800 Hz 1 s 1 s Let s do the quiz Q.1 Q.2 Q.3

Singing voice correct rate [%] Original voice 99.3% Investigation of acoustic cues necessary for discrimination Low-pass Filtering 86.9% Random Splicing (250 ms) 84.3% Random Splicing (200 ms) 76.9% Random Splicing (125 ms) 70.6% Speaking voice correct rate [%] Original voice 100% Low-pass Filtering 98.9% Random Splicing (250 ms) 94.9% Random Splicing (200 ms) 90.0% Random Splicing (125 ms) 95.0% Singing voice Stimuli Speaking voice

Discussion Correct rate of singing voices declined Random splicing technique Temporal structure of the original voices (rhythm and melody pattern) has been modified before after Prolonged vowels of singing voices has been divided into small pieces before ch i Low-pass filtering technique r i b a after Frequency components higher than 800 Hz have been eliminated Important acoustic cues for discrimination?? a ch i a i b i r i

The goal of this study Subjective experiments Short-term spectral feature Temporal structure Importance! Based on knowledge obtained by subjective experiments Automatic vocal style discriminator Spectral feature measure F0 derivative measure

Automatic discrimination measure Spectral feature measure Difference in spectral envelopes and vowel durations Mel-Frequency Cepstrum Coefficients (MFCC) DMFCC (5-frame regression) F0 F0 Amplitude Spectral envelope Frequency Singing voice F0 derivative measure Difference in dynamics of prosody DF0 (5-frame regression) F0 Extraction (PreFEst, Goto1999) Speaking voice Time

Relative Frequency Relative Frequency Training the discriminative model Gaussian mixture models (16-mixture GMM) e.g. Discrimination using DF0 Singing voice GMM 0.06 0.04 Input signal 0.02-100 0 100 DF0 [cent/10ms] Speaking voice GMM 0.04 F0 extraction and DF0 calculation Likelihood comparison for DF0 of each frame 0.02-100 0 100 DF0 [cent/10ms] Singing Speaking

Correct rate [%] 100 90 Automatic discrimination results Human performance 87.6% 80 70 60 Total performance of DF0 Total performance of MFCC+DMFCC 50 0 500 1000 1500 2000 Input signal length [ms] Total performance of MFCC+DMFCC+DF0

Summary and future work Investigation of signal length necessary Not only temporal characteristics but also short-time spectral feature can be a cue for the discrimination Investigation of acoustic cues necessary The relative importance of the temporal structure is found for singing and speaking voice discrimination Automatic vocal style discriminator Feature vector (MFCC+DMFCC+DF0) For 2-s signals, the correct rate is 87.6% Plan to propose new measures to improve the automatic discrimination performance