Comparison Parameters and Speaker Similarity Coincidence Criteria:

Similar documents
Musical Hit Detection

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

MUSI-6201 Computational Music Analysis

Analysis of the effects of signal distance on spectrograms

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Topics in Computer Music Instrument Identification. Ioanna Karydi

Topic 10. Multi-pitch Analysis

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

2. AN INTROSPECTION OF THE MORPHING PROCESS

THE importance of music content analysis for musical

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

S I N E V I B E S FRACTION AUDIO SLICING WORKSTATION

Classification of Timbre Similarity

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Improving Frame Based Automatic Laughter Detection

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

How to use the DC Live/Forensics Dynamic Spectral Subtraction (DSS ) Filter

ISSN ICIRET-2014

Speech and Speaker Recognition for the Command of an Industrial Robot

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

A Survey on: Sound Source Separation Methods

Automatic Defect Recognition in Industrial Applications

Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music

Fraction by Sinevibes audio slicing workstation

Topic 4. Single Pitch Detection

AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH

Pitch-Synchronous Spectrogram: Principles and Applications

UNIT-3 Part A. 2. What is radio sonde? [ N/D-16]

MOVIES constitute a large sector of the entertainment

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Automatic Rhythmic Notation from Single Voice Audio Sources

Lab P-6: Synthesis of Sinusoidal Signals A Music Illusion. A k cos.! k t C k / (1)

Keywords Separation of sound, percussive instruments, non-percussive instruments, flexible audio source separation toolbox

Acoustic Scene Classification

Singer Recognition and Modeling Singer Error

Features for Audio and Music Classification

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

ECG Denoising Using Singular Value Decomposition

Design of a Speaker Recognition Code using MATLAB

Welcome to Vibrationdata

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Supervised Musical Source Separation from Mono and Stereo Mixtures based on Sinusoidal Modeling

Automatic Laughter Detection

Automatic Music Similarity Assessment and Recommendation. A Thesis. Submitted to the Faculty. Drexel University. Donald Shaul Williamson

UNIVERSITY OF DUBLIN TRINITY COLLEGE

2. Problem formulation

/$ IEEE

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

MAutoPitch. Presets button. Left arrow button. Right arrow button. Randomize button. Save button. Panic button. Settings button

Automatic discrimination between laughter and speech

AppNote - Managing noisy RF environment in RC3c. Ver. 4

GRADIENT-BASED MUSICAL FEATURE EXTRACTION BASED ON SCALE-INVARIANT FEATURE TRANSFORM

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

Using the new psychoacoustic tonality analyses Tonality (Hearing Model) 1

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Implementation of Real- Time Spectrum Analysis

Comparative Study on Fingerprint Recognition Systems Project BioFinger

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Voice & Music Pattern Extraction: A Review

Classification of Different Indian Songs Based on Fractal Analysis

Outline. Why do we classify? Audio Classification

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

DIGITAL COMMUNICATION

Automatic Laughter Detection

Subjective Similarity of Music: Data Collection for Individuality Analysis

Analyzing Modulated Signals with the V93000 Signal Analyzer Tool. Joe Kelly, Verigy, Inc.

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

MUSICAL NOTE AND INSTRUMENT CLASSIFICATION WITH LIKELIHOOD-FREQUENCY-TIME ANALYSIS AND SUPPORT VECTOR MACHINES

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

MULTISIM DEMO 9.5: 60 HZ ACTIVE NOTCH FILTER

Transcription of the Singing Melody in Polyphonic Music

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Impact of Frame Loss Aspects of Mobile Phone Networks on Forensic Voice Comparison

Efficient Vocal Melody Extraction from Polyphonic Music Signals

Reducing False Positives in Video Shot Detection

LabView Exercises: Part II

Composer Style Attribution

Project Summary EPRI Program 1: Power Quality

REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

On human capability and acoustic cues for discriminating singing and speaking voices

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Advanced Techniques for Spurious Measurements with R&S FSW-K50 White Paper

Music Source Separation

SECQ Test Method and Calibration Improvements

WE ADDRESS the development of a novel computational

HUMANS have a remarkable ability to recognize objects

Semi-supervised Musical Instrument Recognition

Week 6 - Consonants Mark Huckvale

Adaptive decoding of convolutional codes

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

New DSP Family Traffic Control Plus Feature

Transcription:

Comparison Parameters and Speaker Similarity Coincidence Criteria: The Easy Voice system uses two interrelating parameters of comparison (first and second error types). False Rejection, FR is a probability of error when two voices are being compared that are considered different. The larger the FR value is, the higher is the probability that the two voices are of the same speaker. False Acceptance, FA is a probability of error when two voice samples being compared are considered from the same speaker. The smaller the FA value is, the higher the probability that the two voice samples are from the same speaker The result of comparing these two voice samples (one is a known and the other is an unknown voice) is the FR and FA values are calculated. Likelihood Ratio LR is a measure of similarity between two voices being compared: it is calculated as FR and FA probability ratio. The larger the ratio value is, the higher the probability that the two voice samples are from the same speaker. Comparing these three values, the system then calculates a generalized similarity value for biometric characteristics of the target and source speakers. Values range from 0 to 100. The larger this value is, the higher the probability that the biometric characteristics are from the same voice and speaker. The use of highly technical algorithms are employed which produce statistical numerical data that the program calculates and derives a false rejection number, a false acceptance number, and a likelihood ratio number and produces a unbiased result that it is or is not the same voice. 1

The Easy Voice Biometric Analysis Algorithms: The original analog Spectrograph generally referred to as the Voice Identification V700 or the Kay Sonograph was the tool of the day in which to perform the analysis. The technology at the time limited the analysis to the visual spectrograph comparisons of the formants, since the numerical information regarding pitch, rate, and other factors was not accurately available from the machines. Other equipment could be utilized to ascertain that information, but the listening and formant comparison was the main features on which a conclusion or opinion was formulated. In the 1980 s, Kay Elemetrics, with the help from the Russian Scientists at Speech Technology Center, produced a piece of software called Multispeech. This program was able to replace the analog machines and technology with a new digital version of speech analysis enabling the examiner to conduct the same aural-spectrographic method including all of the numerical results of the pitch, gaps, format tracking, etc. So in essence, it was the start of the modern day Voice Biometric System. The examiner was able to identify or eliminate and make a verification of that analysis. As a result of the Multispeech and IKAR Labs (Speech Technology Center) Voice Analysis programs, the manual method of Voice Identification requiring a verbatim exemplar has been superseded. Although it is advised whenever possible to obtain a verbatim exemplar, it is not always practical or possible. With the proper software techniques, a voice numerical model of a subject s speech can be obtained utilizing the original aural and visual cues listed above. This vast amount of information is fed as a model into a database and a detailed analysis can be conducted within the computer environment to discriminate between one model and another voice model. The experience of the last 50 years in speech technology has taught us that the human voice is one of the best biometric descriptors that can be utilized 2

By using model criteria such as: Resampling of the signal to 11025 rate Determining the recording channel type Extraction of the speech signal (cleansing of the background noise and pauses) Calculation of the clear speech duration Calculation of the signal to noise ratio Calculation of the reverberation level Transformation of the signal files into a Riff Wav (.wav) The ability to take these models and compare them in an apples to apples analysis goes a long way to ensuring scientifically reliable accurate results. Voice Model Calculation: In the current version of the Easy Voice (Voice Grid) three methods of biometric features extraction and pair wise comparison are implemented: Spectral Formant method. Pitch Statistics Analysis. Gaussian Mixture Models-based on Support Vector Machine (SVM) classifier. 3

Spectral-Formant Method. The spectral maxima of speech signal are called formants. They are formed because of the resonances which happen in the vocal tract during speech generation process. The formants (resonance frequencies) depend on the geometrical sizes and shape of vocal tract (head with all the cavities and organs). In general, in the frequency band of phone line (300-3400 Hz) we can find only four formants. The instantaneous values and dynamic traces of those four formants are extracted from the dynamic spectrogram and compared using Support Vector Machine (SVM) classifier. In the picture below you can see the difference between the formant traces in the same phrase pronounced by different people. Spectral-Formant Method provides high reliability of identification results and has the following advantages: Requires just as little as 16 seconds length of speech sample. Channel independent (channel features do not affect the speaker's model). Text and language independent and robust to emotional state changes. 4

High noise-immunity features (signal-to-noise ratio as low as 10-12 db). Noisy signals can be analyzed. Pitch Statistics Analysis Method. Pitch is a fundamental frequency of voice. It is the frequency of vocal folds oscillation. We can control and change this frequency (tone) depending on emotions and stress. That is why direct comparison of pitch value is not possible even for the same text. But the statistics of pitch are measured and compared. In the current version of Easy Voice) the 16 pitch parameters are analyzed: On the picture below you can find an example of the same phrase pronounced by different people. Pitch Statistics Analysis Method is an auxiliary method. The method provides lower reliability because of depending on an emotional state of the speaker (up to 16% of erroneous results are possible) and takes longer time to extract voice parameters and calculate speaker models. Still it shows the following advantages: 5

Requires the minimum of 10 seconds length of speech sample (which is even less than the Spectral-Formant method). Noisy signals can be used. Channel independent. Small size of the model (about 1 KB per one sound file). High speed of the speaker search. Gaussian Mixture Models based Method (GMM). This approach is more statistical and requires computing power. So it cannot be accomplished manually. In simple words, not only the spectral maxima (values of resonance frequencies) are measured and compared, but the shape of those and energy distribution along the frequencies. In another way, it describes the intra-speaker variability in one recording and compares it with intra-speaker variability on the second file. 6

Gaussian Mixture Models based Method is the main comparison method. It demonstrates higher demands to the signal and duration. But it is the most precise approach. The highlights of the method are: High speed of the speaker search. Ideal for clear recordings with low noise level. Ideal for long recordings. Results reflected in EVB The green marker -111- indicates audio files which FR and FA values passed the strong Filter Thresholds. The yellow marker-111- indicates audio files which FR and FA values passed the medium filter thresholds but have not reached the thresholds of the strong. The grey marker -111- indicates audio files which FR and FA passed the soft filter thresholds but have not reached the thresholds of the medium filter thresholds. The rest of the results will have no colored marker. 7

8