Singing voice detection with deep recurrent neural networks

Similar documents
MELODY EXTRACTION FROM POLYPHONIC AUDIO BASED ON PARTICLE FILTER

-To become familiar with the input/output characteristics of several types of standard flip-flop devices and the conversion among them.

Adaptive Down-Sampling Video Coding

Measurement of Capacitances Based on a Flip-Flop Sensor

10. Water tank. Example I. Draw the graph of the amount z of water in the tank against time t.. Explain the shape of the graph.

TRANSFORM DOMAIN SLICE BASED DISTRIBUTED VIDEO CODING

CE 603 Photogrammetry II. Condition number = 2.7E+06

LOW LEVEL DESCRIPTORS BASED DBLSTM BOTTLENECK FEATURE FOR SPEECH DRIVEN TALKING AVATAR

AN ESTIMATION METHOD OF VOICE TIMBRE EVALUATION VALUES USING FEATURE EXTRACTION WITH GAUSSIAN MIXTURE MODEL BASED ON REFERENCE SINGER

A Turbo Tutorial. by Jakob Dahl Andersen COM Center Technical University of Denmark

Hierarchical Sequential Memory for Music: A Cognitive Model

Real-time Facial Expression Recognition in Image Sequences Using an AdaBoost-based Multi-classifier

Video Summarization from Spatio-Temporal Features

Workflow Overview. BD FACSDiva Software Quick Reference Guide for BD FACSAria Cell Sorters. Starting Up the System. Checking Cytometer Performance

THE INCREASING demand to display video contents

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Ad-Hoc DFT Methods Good design practices learned through experience are used as guidelines:

DO NOT COPY DO NOT COPY DO NOT COPY DO NOT COPY

4.1 Water tank. height z (mm) time t (s)

Automatic Selection and Concatenation System for Jazz Piano Trio Using Case Data

MULTI-VIEW VIDEO COMPRESSION USING DYNAMIC BACKGROUND FRAME AND 3D MOTION ESTIMATION

Computer Vision II Lecture 8

Computer Vision II Lecture 8

application software

application software

Nonuniform sampling AN1

Automatic location and removal of video logos

Evaluation of a Singing Voice Conversion Method Based on Many-to-Many Eigenvoice Conversion

UPDATE FOR DESIGN OF STRUCTURAL STEEL HOLLOW SECTION CONNECTIONS VOLUME 1 DESIGN MODELS, First edition 1996 A.A. SYAM AND B.G.

EX 5 DIGITAL ELECTRONICS (GROUP 1BT4) G

Lab 2 Position and Velocity

Coded Strobing Photography: Compressive Sensing of High-speed Periodic Events

A ROBUST DIGITAL IMAGE COPYRIGHT PROTECTION USING 4-LEVEL DWT ALGORITHM

Determinants of investment in fixed assets and in intangible assets for hightech

First Result of the SMA Holography Experirnent

Removal of Order Domain Content in Rotating Equipment Signals by Double Resampling

Performance Rendering for Piano Music with a Combination of Probabilistic Models for Melody and Chords

Truncated Gray-Coded Bit-Plane Matching Based Motion Estimation and its Hardware Architecture

AUTOCOMPENSATIVE SYSTEM FOR MEASUREMENT OF THE CAPACITANCES

G E T T I N G I N S T R U M E N T S, I N C.

Source and Channel Coding Issues for ATM Networks y. ECSE Department, Rensselaer Polytechnic Institute, Troy, NY 12180, U.S.A

LATCHES Implementation With Complex Gates

Mean-Field Analysis for the Evaluation of Gossip Protocols

Telemetrie-Messtechnik Schnorrenberg

Advanced Handheld Tachometer FT Measure engine rotation speed via cigarette lighter socket sensor! Cigarette lighter socket sensor FT-0801

Novel Power Supply Independent Ring Oscillator

2015 Communication Guide

Solution Guide II-A. Image Acquisition. HALCON Progress

On Mopping: A Mathematical Model for Mopping a Dirty Floor

Physics 218: Exam 1. Sections: , , , 544, , 557,569, 572 September 28 th, 2016

Solution Guide II-A. Image Acquisition. Building Vision for Business. MVTec Software GmbH

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

The Art of Image Acquisition

A Delay-efficient Radiation-hard Digital Design Approach Using CWSP Elements

TEA2037A HORIZONTAL & VERTICAL DEFLECTION CIRCUIT

Monitoring Technology

Besides our own analog sensors, it can serve as a controller performing variegated control functions for any type of analog device by any maker.

Region-based Temporally Consistent Video Post-processing

The Art of Image Acquisition

THERMOELASTIC SIGNAL PROCESSING USING AN FFT LOCK-IN BASED ALGORITHM ON EXTENDED SAMPLED DATA

A Methodology for Evaluating Storage Systems in Distributed and Hierarchical Video Servers

And the Oscar Goes to...peeeeedrooooo! 1

VECM and Variance Decomposition: An Application to the Consumption-Wealth Ratio

The Impact of e-book Technology on Book Retailing

Communication Systems, 5e

SC434L_DVCC-Tutorial 1 Intro. and DV Formats

Supercompression for Full-HD and 4k-3D (8k) Digital TV Systems

Enabling Switch Devices

Digital Panel Controller

Diffusion in Concert halls analyzed as a function of time during the decay process

DIGITAL MOMENT LIMITTER. Instruction Manual EN B

Personal Computer Embedded Type Servo System Controller. Simple Motion Board User's Manual (Advanced Synchronous Control) -MR-EM340GF

Trinitron Color TV KV-TG21 KV-PG21 KV-PG14. Operating Instructions M70 M61 M40 P70 P (1)

TUBICOPTERS & MORE OBJECTIVE

SAFETY WITH A SYSTEM V EN

TLE6251D. Data Sheet. Automotive Power. High Speed CAN-Transceiver with Bus Wake-up. Rev. 1.0,

TLE7251V. 1 Overview. Features. Potential applications. Product validation. High Speed CAN-Transceiver with Bus Wake-up

Marjorie Thomas' schemas of Possible 2-voice canonic relationships

BLOCK-BASED MOTION ESTIMATION USING THE PIXELWISE CLASSIFICATION OF THE MOTION COMPENSATION ERROR

TLE7251V. Data Sheet. Automotive Power. High Speed CAN-Transceiver with Bus Wake-up TLE7251VLE TLE7251VSJ. Rev. 1.0,

SINGING VOICE MELODY TRANSCRIPTION USING DEEP NEURAL NETWORKS

Commissioning EN. Inverter. Inverter i510 Cabinet 0.25 to 2.2 kw

R&D White Paper WHP 120. Digital on-channel repeater for DAB. Research & Development BRITISH BROADCASTING CORPORATION.

TLE Overview. High Speed CAN FD Transceiver. Qualified for Automotive Applications according to AEC-Q100

Emergence of invariant representation of vocalizations in the auditory cortex

Student worksheet: Spoken Grammar

United States Patent (19) Gardner

SOME FUNCTIONAL PATTERNS ON THE NON-VERBAL LEVEL

Computer Graphics Applications to Crew Displays

Circuit Breaker Ratings A Primer for Protection Engineers

Video inpainting of complex scenes based on local statistical model

H3CR. Multifunctional Timer Twin Timer Star-delta Timer Power OFF-delay Timer H3CR-A H3CR-AS H3CR-AP H3CR-A8 H3CR-A8S H3CR-A8E H3CR-G.

ZEP - 644SXWW 640SX - LED 150 W. Profile spot

(12) (10) Patent N0.: US 7,260,789 B2 Hunleth et a]. (45) Date of Patent: Aug. 21, 2007

TLE9251V. 1 Overview. High Speed CAN Transceiver. Qualified for Automotive Applications according to AEC-Q100. Features

LCD Module Specification

A Link Layer Analytical Model for High Speed Full- Duplex Free Space Optical Links

Theatrical Feature Film Trade in the United States, Europe, and Japan since the 1950s: An Empirical Study of the Home Market Effect

Q = OCM Pro. Very Accurate Flow Measurement in partially and full filled Pipes and Channels

SiI9127A/SiI1127A HDMI Receiver with Deep Color Output

Drivers Evaluation of Performance of LED Traffic Signal Modules

Transcription:

Singing voice deecion wih deep recurren neural neworks Simon Leglaive, Romain Hennequin, Roland Badeau To cie his version: Simon Leglaive, Romain Hennequin, Roland Badeau. Singing voice deecion wih deep recurren neural neworks. IEEE. 40h Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Apr 2015, Brisbane, Ausralia. pp.121-125, 2015. <hal-01110035> HAL Id: hal-01110035 hps://hal.archives-ouveres.fr/hal-01110035 Submied on 27 Apr 2015 HAL is a muli-disciplinary open access archive for he deposi and disseminaion of scienific research documens, wheher hey are published or no. The documens may come from eaching and research insiuions in France or abroad, or from public or privae research ceners. L archive ouvere pluridisciplinaire HAL, es desinée au dépô e à la diffusion de documens scienifiques de niveau recherche, publiés ou non, émanan des éablissemens d enseignemen e de recherche français ou érangers, des laboraoires publics ou privés.

SINGING VOICE DETECTION WITH DEEP RECURRENT NEURAL NETWORKS Simon Leglaive 1,2 Romain Hennequin 1 Roland Badeau 2 1 Audionamix, 171 quai de Valmy, 75010 Paris, France <firsname>.<lasname>@audionamix.com 2 Insiu Mines-Télécom, Télécom ParisTech, CNRS LTCI, 37-39 rue Dareau, 75014 Paris, France <firsname>.<lasname>@elecom-parisech.fr ABSTRACT In his paper, we propose a new mehod for singing voice deecion based on a Bidirecional Long Shor-Term Memory (BLSTM) Recurren Neural Nework (RNN). This classifier is able o ake a pas and fuure emporal conex ino accoun o decide on he presence/absence of singing voice, hus using he inheren sequenial aspec of a shor-erm feaure exracion in a piece of music. The BLSTM-RNN conains several hidden layers, so i is able o exrac a simple represenaion fied o our ask from low-level feaures. The resuls we obain significanly ouperform sae-of-he-ar mehods on a common daabase. Index Terms Singing Voice Deecion, Deep Learning, Recurren Neural Neworks, Long Shor-Term Memory 1. INTRODUCTION AND PREVIOUS WORK From he audio of a piece of music, localizing he porions ha conain singing voice is a srong informaion ha can be useful for a variey of applicaions including vocal melody exracion [1], singing voice separaion [2, 3] or singer idenificaion [4]. Sae-of-he-ar mehods for singing voice deecion are usually based on machine learning echniques. They sar by exracing a se of feaures from a shor-erm analysis of he audio signal and provide hese feaures as an inpu o a classificaion sysem such as Suppor Vecor Machines (SVMs) [3, 5], Hidden Markov Models (HMMs) [2], Random Foress [6, 7] or Arificial Neural Neworks (ANNs) [3]. The resul of he classifier is hen used o esimae he vocal and non-vocal segmens of he rack, possibly adding a final sep of emporal smoohing, for insance by means of a median filer [6] or a HMM [5]. One can also add a pre-processing sep: in [2] feaures are compued from a signal wih vocal componens enhanced by a Harmonic/Percussive Source Separaion (HPSS) echnique proposed by Ono e al. in [8]. The mosly used feaures come from he speech processing field. In [3] he auhors use a simple combinaion of MFCCs (Mel- Frequency Cepsral Coefficiens), PLPs (Percepual Linear Predicive Coefficiens) and LFPCs (Log Frequency Power Coefficiens) as a feaure se. According o [9], MFCCs and heir derivaives are he mos appropriae feaures. Lehner e al. brough o ligh in [6] he imporance of opimizing he parameers for he MFCCs compuaion, ha is he filer bank size, he number of MFCCs and he analysis window size. They obain quie good resuls only using This work was underaken while Simon Leglaive was working a Audionamix. Roland Badeau is parly suppored by he French Naional Research Agency (ANR) as a par of he EDISON 3D projec (ANR-13-CORD- 0008-02). hese feaures. In [10], Regnier e al. exrac specific characerisics of singing voice: vibrao and remolo. In order o improve sae-of-he-ar resuls, curren singing voice deecion echniques usually focus on he feaure se. One possible approach is o combine a lo of differen simple feaures. In [5], Ramona e al. consider a very large se of quie low-level feaures exraced by wo signal analyses wih differen ime scales. They keep he mos discriminaing ones and make use of an SVM for classificaion. Anoher approach is o design high-level feaures ha highligh he informaion we wan o exrac. This approach is followed by Lehner e al. in [7]; feaures used in his mehod allow a considerable reducion of he false-posiive rae because hey are designed o discriminae singing voice from oher confusing highly harmonic insrumens (such as violin, flue, guiar...). They use a random fores o decide on he presence of voice for each feaure vecor. The approach we presen here for singing voice deecion is quie differen because we do no focus on elaboraing he bes se of feaures. The main poin of our work is he use of a deep BLSTM- RNN o deec singing voice. We show ha a deep archiecure, wih several layers of processing, is able o perform well from low-level feaures. Moreover, unlike making use of models for frame classificaion and emporal smoohing ha canno be easily opimized simulaneously, he recurren aspec of he nework allows he sysem o ake a pas and fuure emporal conex ino accoun o classify each inpu vecor. The paper is organized as follows. Secion 2 oulines RNNs and LSTM blocks. In Secion 3 we presen he feaures we used and how we buil he nework. We describe in Secion 4 our resuls. Finally, in Secion 5 we presen our conclusions. 2. RECURRENT NEURAL NETWORKS AND LONG SHORT-TERM MEMORY 2.1. Recurren Neural Neworks An ANN is an assembly of iner-conneced neurons. A neuron compues is oupu by applying a nonlinear acivaion funcion o he weighed sum of is inpus. Weighs are esimaed during he raining procedure. A Muli-Layer Percepron (MLP) is a feedforward ANN ha maps inpus o oupus by propagaing daa from he inpu layer o he oupu layer, hrough hidden layers. Adding recurren connecions beween neurons makes i possible o handle he sequenial aspec of he inpus. Le us denoe he sequence of inpu feaure vecors S x = {x 1,..., x T }. In he mos general framework, a deep RNN wih N hidden layers evaluaes he sequence of hidden vecors S (n) h = {h (n) 1,..., h(n) T } for n = 1 o N, and he sequence of oupu vecors S y = {y 1,..., y T } by he following ieraive compuaion :

h (n) Fig. 1. RNN unfolded in ime h (0) = x (1) = f (n) ac (W (n 1,n) + W (n,n) h (n) 1 + b(n) ) (2) y = f (N+1) ac (W (N,N+1) h (N) + b (N+1) ) (3) for n = 1,..., N and = 1,..., T, where T is he number of frames. The inpu layer is associaed o n = 0 and he oupu layer o n = N + 1. h (n) denoes he hidden vecor a he oupu of hidden layer n and a ime frame, i is se o zero a = 0. W (n 1,n) is he weigh marix characerizing he feed-forward connecions from layer n 1 o n, while W (n,n) characerizes he recurren connecions of hidden layer n. b (n) denoes he bias vecor and f (n) ac he elemen-wise acivaion funcion for layer n, ofen chosen o be he logisic sigmoid or hyperbolic angen funcion. h (n) depends no only on he oupu of he layer below a ime frame, bu also on he oupu of he curren layer n a ime frame 1, so here are wo direcions of propagaion as represened on Figure 1: in he deph of he layers, like sandard MLP, and in ime. RNNs are inherenly deep in ime, since heir hidden vecors are a funcion of all he previous ones. They are able o model he dynamic of he inpu sream, hey are hus classifiers ha can handle he sequenial aspec of inpu feaures exraced from he shor-erm analysis of a musical audio signal. In a classificaion ask, an RNN considers a pas emporal conex o classify each inpu vecor, he lengh of his conex is auomaically learned hrough he weighs associaed o he recurren connecions. However, a srong limiaion for such a sequence classifier is ha, wih a gradien-based raining algorihm, he emporal conex learned is in pracice limied o only a few insans, because of he vanishing gradien problem [11] : he emporal evoluion of he back-propagaed error exponenially depends on he magniude of he weighs. Thus, he error ends o eiher blow up or vanish as i is back-propagaed in ime, leading o oscillaing weighs, or weighs which say nearly consan. In boh cases he raining procedure is ineffecive and he nework fails o learn long-erm dependencies. 2.2. Long Shor-Term Memory To overcome his issue, we can use LSTM blocks insead of simple neurons in each hidden layer. As represened on Figure 2, each LSTM block involves a memory cell. While he nework is performing he classificaion, is conen is conrolled a each ime sep by he inpu and forge gaes. The cell can sore he inpu of he block i belongs o as long as necessary. The block oupu is conrolled by he oupu gae. During he raining phase, error signals can be rapped wihin a memory cell, muliplicaive gaes will have o learn Fig. 2. LSTM block which error o rap and when o release i. LSTM blocks are hus designed o solve he vanishing gradien problem [12]. The previous ieraive procedure o compue he oupu vecor of each hidden layer (equaion (2)) is modified as follows [13, 14]: c (n) i (n) f (n) = f (n) o (n) = σ(w (n 1,n) (h,i) + W (n,n) (h,i) h(n) 1 (c,i) c(n) 1 + b(n) i ) = σ(w (n 1,n) (h,f) + W (n,n) (h,f) h(n) 1 c (n) 1 + i(n) (c,f) c(n) 1 + b(n) f ) anh(w (n 1,n) (h,c) (h,c) h(n) 1 + b(n) c ) = σ(w (n 1,n) (h,o) + W (n,n) (h,o) h(n) 1 (c,o) c(n) + b (n) o ) h (n) = o (n) anh(c (n) ) (8) where denoes he elemen-wise produc. σ( ) and anh( ) are respecively he elemen-wise logisic sigmoid and hyperbolic angen funcions. i (n), f (n), o (n) and c (n) are respecively he inpu gae, forge gae, oupu gae and memory cell acivaion vecors a hidden layer n and ime frame. These vecors are of he same size as he hidden vecor h (n), ha is he number of LSTM blocks in hidden layer n. Hidden vecors and memory cell vecors are se o zero a = 0. Noe ha equaions (4) o (7) involve differen weigh marices W (,n) (, ), and bias vecors b ( ) (n). Moreover, he weigh marices from memory cells o muliplicaive gaes W (n,n) (c, ) are diagonal, so ha a muliplicaive gae only considers he memory cell of he LSTM block i belongs o. 2.3. Bidirecional Recurren Neural Neworks RNNs are only able o make use of a pas emporal conex. When he whole sequence of inpu feaures is available, i can be useful o exploi he fuure conex as well. This can be done using a bidirecional RNN (BRNN). Each hidden layer of a BRNN conains wo independen layers: he forward layer ( ) ha applies equaion (2) from = 1 o = T and he backward layer ( ) ha proceeds in he reverse order, replacing 1 by +1 and ieraing over = T,..., 1. For each ime sep, he acivaions of he n-h forward and backward hidden layers are concaenaed in a single vecor (equaion (9)) and supplied as an inpu o he nex layer: = [ h (n) h (n) ; (4) (5) (6) (7) h (n) ] (9)

LSTM-RNNs have proven heir superioriy over sandard RNNs o learn long-erm dependencies [12] and wih a precise iming [15]. To make use of a long-range pas and fuure emporal conex o classify each inpu vecor, he ideas of deep BRNNs and LSTM can hus be combined o form deep BLSTM-RNNs. This is he archiecure we adoped in his sudy. 3. SYSTEM OVERVIEW componens. Our feaure vecor is hus 80 coefficien-long corresponding o he concaenaion of he oupus of he filer bank applied o h 2() and p 2(). We consider he logarihm of his vecor, in order o reduce he dynamics of he daa. Finally, each dimension of he inpu vecor is normalized so as o have a mean close o zero and a sandard deviaion close o 1 over he raining daabase. This condiioning, along wih weighs iniializaion, is imporan in order o preven neurons sauraion and o make he learning fas [18]. 3.2. Building he Nework by Incremenal Training Fig. 3. Sysem Overview As represened on Figure 3, he proposed sysem firs applies a double sage HPSS as pre-processing. Feaures are hen exraced from a filer bank on a Mel scale and supplied as inpu o he deep BLSTM-RNN. The blocks of our sysem are described in more deails below. 3.1. Feaure Exracion Insead of presening high-level feaures a he inpu of he classifier, whose design is essenially handcrafed and possibly sub-opimal, we chose o use low-level feaures, exraced from a filer bank disribued on a Mel scale. We were hoping ha, hrough he hidden layers, a deep archiecure would be able o exrac higher-level represenaions of he inpu daa, fied o our ask. To compue he feaures, we work on mono signals resampled a 16kHz and normalized o lie beween 1 and 1. We firs apply a double sage HPSS as proposed in [16]. The original idea of HPSS [8] is o decompose he specrogram of he inpu signal ino one specrogram smooh in ime direcion, associaed o harmonic componens, and anoher specrogram smooh in frequency direcion, associaed o percussive componens. Singing voice is a flucuaing sound, no as saionary as harmonic insrumens like piano or guiar, bu obviously much more han percussive ones, i hus lies beween harmonic and percussive componens in HPSS. By conrolling he ime/frequency resoluion hrough he analysis window, we hus can consider he parials of singing voice as smooh in ime or frequency direcion. From a firs HPSS wih a long (256ms) analysis window, singing voice is associaed o percussive componens ino a signal p 1(), and separaed from emporally-sable, harmonic sounds conained in a signal h 1(). Applying a second HPSS from p 1(), wih a shor analysis window (32ms), singing voice is hen associaed o harmonic componens ino a signal h 2(), and isolaed from percussive sounds ha will be conained in a signal p 2(). Finally, h 2() is a rough esimaion of he singing voice signal. For each of he hree signals h 1(), p 2() and h 2(), we compued he Shor-Time Fourier Transform (STFT) wih a 32ms Hann window and 50% overlap. 40 coefficiens are hen exraced from 40 riangular filers linearly spaced on a Mel scale wih 50% overlap. A frequency equal o f Herz is mapped o Mel by f Mel = 2595 log(1 + f/700) [17]. We ried differen combinaions of feaures from he hree signals, we obained he bes resuls by keeping feaures from signals associaed o singing voice and percussive b n : 10 20 30 40 50 80-b 1-1 15.5 14.6 13.5 14.1 14.2 80-30-b 2-1 11.4 10.7 12.2 12.4 14.0 80-30-20-b 3-1 9.4 9.3 10.5 8.5 9.6 80-30-20-40-b 4-1 9.3 10.0 9.4 12.0 9.4 Table 1. Classificaion error (%) on he Jamendo es daase (cf. Secion 4.1) according o he nework archiecure - The lef column gives layer sizes from inpu o oupu - b n is he number of LSTM blocks in hidden layer n. A difficuly wih neural neworks is ha here is no heoreical evidence o define he archiecure a priori for a given ask, ha is he number of hidden layers and he number of neurons or LSTM blocks wihin each layer. The inpu layer size is fixed by he dimension of he inpu vecor, 80 for our experimens. As we are working on a binary classificaion ask, here is one unique neuron in he oupu layer wih a logisic sigmoid acivaion funcion. Is oupu is an esimaion of he probabiliy of singing voice presence. For a deep RNN wih several hidden layers, an incremenal procedure o rain he nework has been proposed in [19], which progressively adds he hidden layers. I allows each layer o have some ime during raining in which i is direcly conneced o he oupu layer. When a hidden layer is added, he weighs previously learned from he layers below are kep and hen he whole nework is rained. For he curren hidden layer and he oupu one, we choose o iniialize he weighs according o a Gaussian disribuion wih mean 0 and sandard deviaion 0.1. We found his raining procedure o be more effecive han a raw raining of he whole nework. The explanaion is suggesed in [20], he auhors experimenally show ha in a supervised gradien-rained deep neural nework wih random weighs iniializaion, layers far from he oupus are poorly opimized. The auhors hus propose an independen pre-raining of each hidden layer. The incremenal procedure we used here is anoher soluion. In our work, we exended his procedure in order o auomaically learn he nework archiecure during he raining: we add hidden layers progressively and for each one we choose he size b n ha minimizes he classificaion error on he es daase, as represened in Table 1. The procedure is sopped when adding a new hidden layer does no improve he classificaion resuls. When raining a neural nework, we are no much ineresed in he opimizaion problem, bu raher in he generalizaion one. By considering he classificaion error on he es daase, we are looking for he model ha bes generalizes o unseen daa. As we can see from he resuls in Table 1, he bes archiecure we found is a BLSTM-RNN wih hree hidden layers whose sizes are 30, 20 and 40. Wihin each layer, LSTM blocks are fairly disribued beween forward and backward layers.

3.3. Training Algorihm As he oupu of he nework is an esimaion of he probabiliy of singing voice presence, we used he cross-enropy error as loss funcion. Each raining phase is done by Back-Propagaion Through Time (BPTT) in he conex of LSTM neworks [21, 14]. We used he open-source CURRENNT Toolki 1 which implemens BPTT on a Graphics Processing Uni (GPU). Weighs are updaed afer each sequence. Wihin each epoch, sequences are seleced randomly. Over-fiing is conrolled by early-sopping: raining sars wih a sep for he gradien descen η = 10 5 and a momenum m = 0.9. If he cross-enropy error does no improve on he validaion se afer 20 epochs, we se η = 10 6 and he raining coninues from he weighs associaed wih he las improvemen. Afer 10 epochs, if here is no improvemen, we se η = 10 7 and he raining coninues as before, and finally, if here is no improvemen wih his las sep during 10 epochs, raining is sopped. The momenum is chosen close o one in order o keep ineria high enough o avoid local minima and o aenuae he oscillaory rajecory of he sochasic gradien descen. 4. RESULTS 4.1. Jamendo : A Common Benchmark Daase For our experimens, we used he Jamendo Corpus, a publicly available daase including singing voice aciviy annoaions. I conains 93 copyrigh-free songs, rerieved from he Jamendo websie 2. The daabase was buil and published along wih [5]. The corpus is divided ino hree ses: he raining se conains 61 files, he validaion and es ses conain 16 songs each. This is a common daabase, which provides a fair comparison of our approach wih ohers from he lieraure. 4.2. Nework Funcioning To highligh he nework inernal funcioning, we represen on Figure 4 he sequence of inpu vecors (h 2() a he lower half and p 2() a he upper one, cf. Secion 3.1), he oupu of each hidden layer, he oupu of he nework which is an esimaion of he probabiliy of singing voice presence, he decision aken by he nework by hresholding his probabiliy a 0.5, and he ground ruh for abou 7s of a rack from he Jamendo es daase. Through he deph of he nework, he oupus of he layers are more and more sable and a clear emporal srucure emerges, wih he appearance of segmens associaed o singing voice presence/absence. From a low-level represenaion, which is highly emporally variable, exraced by a filer bank on a Mel scale, he nework is able o exrac a simple represenaion a he oupu of he hird hidden layer, highlighing singing voice presence. The rack we used here conains a long oal silence secion. We can see ha he oupus of he hidden layers coninue o vary during his secion while inpus remain consan. This observaion shows ha he nework has learned a emporal conex. 4.3. Resuls To evaluae he performance of our sysem, we compue four common evaluaion measures [22] considering all he frames of he es se. The classificaion Accuracy is he proporion of frames correcly classified. The Recall is he proporion of frames labeled as voiced in he ground ruh ha are esimaed as voiced by he algorihm. The Precision is he proporion of frames esimaed as voiced by he algorihm ha are effecively voiced in he ground ruh. Finally, he F-measure (also called F1 score) is a global performance measure corresponding o he harmonic mean of precision and recall. Table 2 compares he resuls of our mehod wih hose from [5], [6] and [7], he laer being he one which provided he bes resuls on his daabase o he bes of our knowledge. We see ha over all he measures, our sysem performs beer. This is remarkable considering ha we used simple low-level feaures and no pos-processing. In [7], Lehner e al. improved precision by means of specifically designed feaures bu o he derimen of recall compared o heir previous mehod in [6]. In fac, manually designing high-level feaures can be sub-opimal. Conversely, he deep BLSTM-RNN we used has auomaically learned how o exrac useful informaion from low-level feaures o finally improve boh recall and precision. This is noiceable and explains he paricularly high F-measure we obain. RAMONA [5] LEHNER(a) [6] LEHNER(b) [7] NEW Accuracy (%) 82.2 84.8 88.2 91.5 Recall (%) n/a 90.4 86.2 92.6 Precision (%) n/a 79.5 88.0 89.5 F-measure 84.3 84.6 87.1 91.0 Table 2. Singing voice deecion resuls on Jamendo es daabase 5. CONCLUSION Fig. 4. Nework funcioning on a 7s excerp of 03 - Say me goodbye from he Jamendo es daabase - color scale beween -1 (whie) and +1 (black) - Oupu belongs o [0,1] - Decision and Truh belong o {0,1} in which 0 (grey segmens) denoes voice absence and 1 (black segmens) denoes voice presence. 1 hp://sourceforge.ne/p/currenn 2 hp://www.jamendo.com In his paper we presened a new approach for singing voice deecion. Insead of working on defining a complex feaure se, we ook advanage of neural neworks o exrac simple represenaions fied o our ask from low-level feaures. Furhermore, he BLSTM-RNN we used is a classifier ha inherenly akes a emporal conex ino accoun, hus discarding he necessiy of pos-processing o handle sequenial aspecs. This new mehod significanly improved sae-ofhe-ar resuls on a common daabase. This performance encourages furher work wih BLSTM-RNN in music informaion rerieval for sequence classificaion asks, for insance in he conex of auomaic melody esimaion.

6. REFERENCES [1] Jusin Salamon, Emilia Gomez, Dan Ellis, and Gaël Richard, Melody exracion from polyphonic music signals: Approaches, applicaions and challenges, IEEE Signal Processing magazine, vol. 31, no. 2, pp. 118 134, Mar. 2014. [2] Chao-Ling Hsu, DeLiang Wang, Jyh-Shing Roger Jang, and Ke Hu, A andem algorihm for singing pich exracion and voice separaion from music accompanimen, IEEE Transacions on Audio, Speech, and Language Processing, vol. 20, no. 5, pp. 1482 1491, Jul. 2012. [3] Shankar Vembu and Sephan Baumann, Separaion of vocals from polyphonic audio recordings, in Inernaional Sociey for Music Informaion Rerieval (ISMIR) Conference, London, UK, Sep. 2005, pp. 337 344. [4] Youngmoo E. Kim and Brian Whiman, Singer idenificaion in popular music recordings using voice coding feaures, in Inernaional Sociey for Music Informaion Rerieval (ISMIR) Conference, Paris, France, Oc. 2002, vol. 13, p. 17. [5] Mahieu Ramona, Gaël Richard, and Berrand David, Vocal deecion in music wih suppor vecor machines, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Las Vegas, Nevada, USA, Mar. - Apr. 2008, pp. 1885 1888. [6] Bernhard Lehner, Reinhard Sonnleiner, and Gerhard Widmer, Towards ligh-weigh, real-ime-capable singing voice deecion., in Inernaional Sociey for Music Informaion Rerieval (ISMIR) Conference, Curiiba, Brazil, Nov. 2013, pp. 53 58. [7] Bernhard Lehner, Gerhard Widmer, and Reinhard Sonnleiner, On he reducion of false posiives in singing voice deecion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Florence, Ialy, May 2014, pp. 7480 7484. [8] Nobuaka Ono, Kenichi Miyamoo, Jonahan Le Roux, Hirokazu Kameoka, and Shigeki Sagayama, Separaion of a monaural audio signal ino harmonic/percussive componens by complemenary diffusion on specrogram, in European Signal Processing Conference (EUSIPCO), Lausanne, Swizerland, Aug. 2008. [9] Marín Rocamora and Perfeco Herrera, Comparing audio descripors for singing voice deecion in music audio files, in Brazilian Symposium on Compuer Music (SBCM), San Pablo, Brazil, Sep. 2007, vol. 26, p. 27. [10] Lise Regnier and Geoffroy Peeers, Singing voice deecion in music racks using direc voice vibrao deecion, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Taipei,Taiwan, Apr. 2009, pp. 1685 1688. [11] Sepp Hochreier, Yoshua Bengio, Paolo Frasconi, and Jürgen Schmidhuber, Gradien flow in recurren nes: he difficuly of learning long-erm dependencies, in A Field Guide o Dynamical Recurren Neural Neworks, Kremer and Kolen, Eds. IEEE Press, 2001. [12] Sepp Hochreier and Jürgen Schmidhuber, Long shor-erm memory, Neural compuaion, vol. 9, no. 8, pp. 1735 1780, Nov. 1997. [13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinon, Speech recogniion wih deep recurren neural neworks, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Vancouver, BC, Canada, May 2013, pp. 6645 6649. [14] Alex Graves, Supervised Sequence Labelling wih Recurren Neural Neworks, Ph.D. hesis, Technische Universiä München, Jul. 2008. [15] Felix A. Gers, Nicol N. Schraudolph, and Jürgen Schmidhuber, Learning precise iming wih LSTM recurren neworks, The Journal of Machine Learning Research, vol. 3, pp. 115 143, Aug. 2002. [16] Hideyuki Tachibana, Takuma Ono, Nobuaka Ono, and Shigeki Sagayama, Melody line esimaion in homophonic music audio signals based on emporal-variabiliy of melodic source, in IEEE Inernaional Conference on Acousics, Speech and Signal Processing (ICASSP), Dallas, Texas, USA, Mar. 2010, pp. 425 428. [17] Douglas D. O Shaughnessy, Speech communicaions - human and machine (2. ed.)., IEEE Press, 2000, p. 128. [18] Yann LeCun, Leon Boou, Genevieve B. Orr, and Klaus- Rober Müller, Efficien BackProp, in Neural Neworks: Tricks of he rade, Genevieve B. Orr and Klaus-Rober Müller, Eds. 1998, Springer. [19] Michiel Hermans and Benjamin Schrauwen, Training and analysing deep recurren neural neworks, in Advances in Neural Informaion Processing Sysems (NIPS), Harrahs and Harveys, Lake Tahoe, Nevada, Unied Saes, Dec. 2013, pp. 190 198. [20] Yoshua Bengio, Pascal Lamblin, Dan Popovici, and Hugo Larochelle, Greedy layer-wise raining of deep neworks, in Advances in Neural Informaion Processing Sysems (NIPS), Vancouver, BC, Canada, Dec. 2007, vol. 19, p. 153. [21] Alex Graves and Jürgen Schmidhuber, Framewise phoneme classificaion wih bidirecional LSTM and oher neural nework archiecures, Neural Neworks, Jul.-Aug. 2005. [22] Marina Sokolova and Guy Lapalme, A sysemaic analysis of performance measures for classificaion asks, Informaion Processing & Managemen, vol. 45, no. 4, pp. 427 437, Jul. 2009.