Analyzing the influence of pitch quantization and note segmentation on singing voice alignment in the context of audio-based Query-by-Humming

Similar documents
Motivation. Analysis-and-manipulation approach to pitch and duration of musical instrument sounds without distorting timbral characteristics

RHYTHM TRANSCRIPTION OF POLYPHONIC MIDI PERFORMANCES BASED ON A MERGED-OUTPUT HMM FOR MULTIPLE VOICES

Quality improvement in measurement channel including of ADC under operation conditions

MODELLING PERCEPTION OF SPEED IN MUSIC AUDIO

Line numbering and synchronization in digital HDTV systems

Australian Journal of Basic and Applied Sciences

Chapter 7 Registers and Register Transfers

VOCALS SYLLABUS SPECIFICATION Edition

Energy-Efficient FPGA-Based Parallel Quasi-Stochastic Computing

The Blizzard Challenge 2014

A Novel Method for Music Retrieval using Chord Progression

EE260: Digital Design, Spring /3/18. n Combinational Logic: n Output depends only on current input. n Require cascading of many structures

Logistics We are here. If you cannot login to MarkUs, me your UTORID and name.

PROBABILITY AND STATISTICS Vol. I - Ergodic Properties of Stationary, Markov, and Regenerative Processes - Karl Grill

Research on the Classification Algorithms for the Classical Poetry Artistic Conception based on Feature Clustering Methodology. Jin-feng LIANG 1, a

References and quotations

Implementation of Expressive Performance Rules on the WF-4RIII by modeling a professional flutist performance using NN

2 Specialty Application Photoelectric Sensors

Math of Projections:Overview. Perspective Viewing. Perspective Projections. Perspective Projections. Math of perspective projection

Background Manuscript Music Data Results... sort of Acknowledgments. Suite, Suite Phylogenetics. Michael Charleston and Zoltán Szabó

Working with PlasmaWipe Effects

Internet supported Analysis of MPEG Compressed Newsfeeds

T-25e, T-39 & T-66. G657 fibres and how to splice them. TA036DO th June 2011

NIIT Logotype YOU MUST NEVER CREATE A NIIT LOGOTYPE THROUGH ANY SOFTWARE OR COMPUTER. THIS LOGO HAS BEEN DRAWN SPECIALLY.

Polychrome Devices Reference Manual

ABSTRACT. woodwind multiphonics. Each section is based on a single multiphonic or a combination thereof distributed across the wind

Voice Security Selection Guide

Reliable Transmission Control Scheme Based on FEC Sensing and Adaptive MIMO for Mobile Internet of Things

PIANO SYLLABUS SPECIFICATION. Also suitable for Keyboards Edition

THE Internet of Things (IoT) is likely to be incorporated

PROJECTOR SFX SUFA-X. Properties. Specifications. Application. Tel

PowerStrip Automatic Cut & Strip Machine

CODE GENERATION FOR WIDEBAND CDMA

Comparative Study of Different Techniques for License Plate Recognition

RELIABILITY EVALUATION OF REPAIRABLE COMPLEX SYSTEMS AN ANALYZING FAILURE DATA

Before you submit your application for a speech generating device, we encourage you to take the following steps:

Image Enhancement in the JPEG Domain for People with Vision Impairment

Mullard INDUCTOR POT CORE EQUIVALENTS LIST. Mullard Limited, Mullard House, Torrington Place, London Wel 7HD. Telephone:

Music Scope Headphones: Natural User Interface for Selection of Music

STx. Compact HD/SD COFDM Transmitter. Features. Options. Accessories. Applications

Volume 20, Number 2, June 2014 Copyright 2014 Society for Music Theory

Forces: Calculating Them, and Using Them Shobhana Narasimhan JNCASR, Bangalore, India

L-CBF: A Low-Power, Fast Counting Bloom Filter Architecture

The new, parametrised VS Model for Determining the Quality of Video Streams in the Video-telephony Service

Achieving 550 MHz in an ASIC Methodology

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Recognition of Human Speech using q-bernstein Polynomials

Image Intensifier Reference Manual

Manual Industrial air curtain

Analysis and Detection of Historical Period in Symbolic Music Data

A Backlight Optimization Scheme for Video Playback on Mobile Devices

A CHROMA-BASED SALIENCE FUNCTION FOR MELODY AND BASS LINE ESTIMATION FROM MUSIC AUDIO SIGNALS

9311 EN. DIGIFORCE X/Y monitoring. For monitoring press-fit, joining, rivet and caulking operations Series 9311 ±10V DMS.

Entropy ISSN by MDPI

Research Article Measurements and Analysis of Secondary User Device Effects on Digital Television Receivers

The Communication Method of Distance Education System and Sound Control Characteristics

NewBlot PVDF 5X Stripping Buffer

Apollo 360 Map Display User s Guide

2 Specialty Application Photoelectric Sensors

Manual Comfort Air Curtain

ROUNDNESS EVALUATION BY GENETIC ALGORITHMS

NexLine AD Power Line Adaptor INSTALLATION AND OPERATION MANUAL. Westinghouse Security Electronics an ISO 9001 certified company

MOBILVIDEO: A Framework for Self-Manipulating Video Streams

Higher-order modulation is indispensable in mobile, satellite,

Detection of Historical Period in Symbolic Music Text

Practice Guide Sonata in F Minor, Op. 2, No. 1, I. Allegro Ludwig van Beethoven

BOUND FOR SOUTH AUSTRALIA

Randomness Analysis of Pseudorandom Bit Sequences

2 Specialty Application Photoelectric Sensors

Read Only Memory (ROM)

ttco.com

Quantifying Domestic Movie Revenues Using Online Resources in China

Sensor Data Processing and Neuro-inspired Computing

Daniel R. Dehaan Three Études For Solo Voice Summer 2010, Chicago

,..,,.,. - z : i,; ;I.,i,,?-.. _.m,vi LJ

Organic Macromolecules and the Genetic Code A cell is mostly water.

2 Specialty Application Photoelectric Sensors

Part II: Derivation of the rules of voice-leading. The Goal. Some Abbreviations

SMARTEYE ColorWise TM. Specialty Application Photoelectric Sensors. True Color Sensor 2-65

AN IMPROVED VARIABLE STEP-SIZE AFFINE PROJECTION SIGN ALGORITHM FOR ECHO CANCELLATION * Jianming Liu and Steven L Grant 1

MultiTest Modules. EXFO FTB-3923 Specs Provided by FTB-3920 and FTB-1400

Manual RCA-1. Item no fold RailCom display. tams elektronik. n n n

Music Radar: A Web-based Query by Humming System

8825E/8825R/8830E/8831E SERIES

COLLEGE READINESS STANDARDS

A Simulation Experiment on a Built-In Self Test Equipped with Pseudorandom Test Pattern Generator and Multi-Input Shift Register (MISR)

Prokofieff, Serge. Piano Sonatas. [Nos. 1 9] Edited and annotated by Irwin Freundlich. New York: Leeds Music Corporation, 1957.

A Model of Metric Coherence

THE UNIVERSITY OF THE SOUTH PACIFIC LIBRARY Author Statement of Accessibility. Yes % %

Study Guide. Advanced Composition

Melody, Bass Line, and Harmony Representations for Music Version Identification

Innovation in the Multi-Screen World. Sirius 800 Series. Multi-format, expandable routing that stands out from the crowd

Sigma 3-30KS Sigma 3-30KHS

LDPC-PAM12 PHY proposal for 10GBase-T. P802.3an July 04 Jose Tellado, Teranetics Katsutoshi Seki, NEC Electronics

TRAINING & QUALIFICATION PROSPECTUS

Tobacco Range. Biaxially Oriented Polypropylene Films and Labels. use our imagination...

Manual WIB Carriage lighting Colour of lighting: warm white. Item no tams elektronik. tams elektronik n n n

The song remains the same: identifying versions of the same piece using tonal descriptors

Because your pack is worth protecting. Tobacco Biaxially Oriented Polypropylene Films. use our imagination...

A. Flue Pipes. 2. Open Pipes. = n. Musical Instruments. Instruments. A. Flue Pipes B. Flutes C. Reeds D. References

Transcription:

Aalyzig the ifluece of pitch quatizatio ad ote segmetatio o sigig voice aligmet i the cotext of audio-based Query-by-Hummig Jose J. Valero-Mas Patter Recogitio ad Artificial Itelligece Group, Uiversity of Alicate jjvalero@dlsi.ua.es Justi Salamo Music ad Audio Research Laboratory, New York Uiversity justi.salamo@yu.edu Emilia Gómez Music Techology Group, Uiversitat Pompeu Fabra emilia.gomez@upf.edu ABSTRACT Query-by-Hummig (QBH) systems base their operatio o aligig the melody sug/hummed by a user with a set of cadidate melodies retrieved from polyphoic sogs. While MIDI-based QBH builds o the premise of existig aotated trascriptios for ay cadidate sog, audiobased research makes use of melody estimatio algorithms for the sogs. I both cases, a melody abstractio process is required for solvig issues commoly foud i queries such as key traspositios or tempo deviatios. Full automatic music processes are commoly used for this, but due to the reported limitatios i state-of-the-art methods for real-world queries, other possibilities should be cosidered. I this work we explore three differet melody represetatios, ragig from a geeral time-series oe to more musical abstractios, which avoid full automatic trascriptio, i the cotext of a audio-based QBH system. Results show that this abstractio process plays a key role i the overall accuracy of the system, obtaiig the best scores whe temporal segmetatio is dyamically performed i terms of pitch chage evets i the melodic cotour. 1. INTRODUCTION Query-by-Hummig systems costitute a particular case of cotet-based music similarity search schemes i which the iput query is a sug, hummed or whistled sectio of a sog, usually its mai melody [1, 2], ad the output is the target sog. Such a music retrieval paradigm stads as a iterestig alterative to classic text-based retrieval frameworks (for istace, tag-based search) for its simple usage complemeted by the fact that o musical kowledge from the user is required [3]. Research i QBH maily focuses o addressig the iaccuracies foud whe producig the queries: o the oe had, tuig issues have to be cosidered as users may sig out of tue ad/or i a differet key [4]; o the other had, tempo deviatios amog queries ad cadidates may also occur [4, 5]. For overcomig them, a melody abstractio process, which may rage from geeral time-series Copyright: c 2015 Jose J. Valero-Mas et al. This is a ope-access article distributed uder the terms of the Creative Commos Attributio 3.0 Uported Licese, which permits urestricted use, distributio, ad reproductio i ay medium, provided the origial author ad source are credited. codificatios to more sophisticated music-based oes, followed by a melody compariso stage are performed for estimatig the dissimilarity betwee the query ad the cadidates [6]. The process for obtaiig the set of cadidate melodies is ot trivial [2, 5, 7]: mai fudametal frequecy (f0) estimatio for queries ad cadidates caot be assumed as a accurate process, especially whe dealig with polyphoic sogs [8]. While this estimatio process is ievitable for the queries as they costitute the user audio iput to the system, this issue has bee typically avoided for the cadidate sogs by assumig the existece ad availability of high-level aotated represetatios (for istace, MIDI files) of these melodies. Due to the limitatios the previous assumptio implies, mostly i terms of practical systems, some QBH schemes try to estimate this melody algorithmically from audio. Although more realistic, this adds more complexity to the system sice o melody estimatio algorithm is error-free. As aforemetioed, melodic cotours require of a abstractio process. For takig advatage of the large amout of research carried i the symbolic melodic similarity field, melodies estimated from audio sources are coded ito highlevel music represetatios [9], usually with full automatic music trascriptio systems. However, give the limitatios curret state-of-the-art trascriptio algorithms exhibit [10], it seems iterestig to study alterative abstractios to such high-level represetatios. I this paper we preset a study of the ifluece of differet melody abstractio processes which avoid the complexity of full automatic music trascriptio i the cotext of QBH. Particularly, we assess the ifluece of pitch quatizatio ad ote segmetatio i sigig voice aligmet for QBH. For that, we take as startig poit the scheme i Figure 1 ad we evaluate three differet melodic cotour represetatios: the first oe makes use of the time-series ecodig algorithm Symbolic Aggregate Approximatio (SAX) [11], which is based o a fixed-duratio temporal segmetatio ad statistical ecodig; the secod oe modifies the origial SAX algorithm so that the ecodig is performed usig a semitoe-bad represetatio; fially, as a third method we propose to segmet the melody usig the pitch chage evets i the melodic cotour. To esure the scalability of the system we use the melody estimatio algorithm MELODIA [12]. This method estimates the predomiat pitch from both moophoic ad

polyphoic music sigals. I terms of the cotour compariso, we apply two sequece aligmet algorithms: Smith- Waterma [13], origially meat for DNA sequeces but with large applicatio i the time series field, ad Subsequece Dyamic Time Warpig [14]. The rest of the paper is structured as it follows: Sectio 2 briefly reviews similar research proposals; Sectio 3 ad Sectio 4 preset the melody extractio algorithm MELO- DIA ad the local aligmet algorithms cosidered respectively; Sectio 5 itroduces the assessed cotour represetatios; Sectio 6 presets the evaluatio methodology; Sectio 7 presets ad discusses the results obtaied; fially, Sectio 8 outlies the coclusios obtaied ad proposes possible future work. 2. RELATED WORK Oe of the first proposed QBH systems was the oe by Ghias et al. [15] i which queries were trascribed usig autocorrelatio for pitch trackig, the cadidate elemets were MIDI files ad the search was performed usig a fuzzy strig matchig algorithm. Although may similar systems based o some kid of full automatic music trascriptio have bee proposed sice the, the work by Daeberg et al. [3] with the MUSART Testbed, a framework for the assessmet of this type of QBH systems, stads as a relevat example. I terms of systems ot based o full automatic music trascriptio, a relevat example is the oe by Duda et al. [1] i which a series of audio descriptors (Mel-Frequecy Cepstrum Coefficiets, Power, Fudametal frequecy cotour, Voice Formats ad Chroma) are extracted from the audio files ad are the ecoded usig SAX [11]; similarity is performed usig Edit distace [17]. Aother example ca be foud i the system by Ito et al. [5]. I this case, istead of obtaiig a sigle melodic cotour for the cadidate elemets, multiple fudametal frequecy cadidates are retrieved, usig a variatio of the PreFEst algorithm [18], for compariso to the query cotour usig a basic scorig fuctio. Salamo et al. [2] proposed a system i which melodies are quatized ito semitoes ad mapped ito oe octave. Similarity is performed usig the Q max algorithm [19]. I terms of the automatic extractio of melodies, some explored techiques use fudametal frequecy extractio algorithms [5, 16], mai sigig voice extractio [1, 7] or the use of predomiat melody estimatio algorithms [2]. All approaches are summarized i Table 1. 3. MELODY ESTIMATION Melodies from both queries ad cadidate sogs are obtaied usig the predomiat melody estimatio algorithm MELODIA [12] 1. For a give music piece, the algorithm estimates the fudametal frequecy of the predomiat melodic lie i the sog. This particular algorithm outperformed all other state-of-the-art methods i the 2011 Music Iformatio Retrieval Evaluatio exchage (MIREX) 1 http://mtg.upf.edu/techologies/melodia campaig 2 i the Audio Melody Extractio task. I a more detailed aalysis, results i [12] report its robustess i terms of octave errors (properly trackig pitch values i the correct octave) ad voiced frame detectio (frames belogig to the predomiat melody). However, it must be also poited out that the algorithm teds to cofuse uvoiced elemets as voiced, thus lowerig the overall performace. Fially, we provide a brief explaatio to the four stages MELODIA comprises: a iitial Siusoid extractio step estimates the predomiat frequecy values at each istat i the sigal; the, a Saliece fuctio based o a harmoic series is derived; after that, a series of Pitch cotours are created usig a set of rules based o Auditory Scee Aalysis (ASA) for fially selectig the predomiat melody i the Melody selectio stage. I this experimetatio, MELODIA has bee cofigured to its default aalysis rate ( t MEL = 2.9 ms). Audio sigal MELODIA Siusoid extractio Spectral peaks Saliece fuctio Saliece represetatio Cotour creatio Pitch cotours Melody selectio Melody Figure 2. Block diagram of the MELODIA algorithm. 4. MELODY ALIGNMENT I this work, similarity betwee the query ad the cadidate melodies is estimated by meas of sequece aligmet methods. This premise suits the QBH task as queries may cotai tempo deviatios with respect to the correspodig melodies of the actual sog to be retrieved. The two algorithms cosidered are ow itroduced. 4.1 Smith-Waterma The Smith-Waterma () method [13] is a aligmet algorithm formerly proposed for DNA sequeces. This algorithm performs a search for the most similar regios betwee a pair of sequeces, coded as strigs, i a timewarped sceario. Smith-Waterma requires a series of costs to be defied: a reward for symbol matches (C MATCH ), a pealty for mismatches (C MISMATCH ) ad two costs for time warps (C INSERTION ad C DELETION ). Table 2 shows the differet cofiguratios cosidered. 4.2 Subsequece Dyamic Time Warpig Subsequece Dyamic Time Warpig (S-DTW) costitutes a modificatio o Dyamic Time Warpig (DTW) proposed by Müller i [14]. While DTW forces a global aligmet betwee two sequeces, S-DTW elimiates that restrictio for allowig local matches betwee the sequeces. The modificatio makes it suitable for query-by-example 2 http://www.music-ir.org/mirex/wiki/mirex HOME

User Audio Query MELODIA F0 cotour Melody Abstractio Coded cotour Sequece aligmet / Raked Output / Music Collectio Audio File / MELODIA F0 cotour / Melody Abstractio Coded cotour / Cotour Database Database retrieval ad formattig Figure 1. Scheme of the QBH system proposed. Mai melodies are estimated from the audio files (query ad cadidate sogs) usig the melody estimatio algorithm MELODIA, beig the ecoded usig a certai cotour represetatio; local aligmet betwee the query ad each elemet i the database is the performed ad the results are evetually raked. First Author Ghias [15] Daeberg [3] Duda [1] Jeo [16] Ito [5] Feature(s) Feature extractio Query Music collectio Abstractio Similarity Strigs represetig Mai F0 Pitch trackig MIDI chages i cotour: Fuzzy strig cotour (autocorrelatio) files U (up), D (dow) matchig ad S (same) Note Iterval, Mai F0 Pitch trackig MIDI IOI + Relative pitch, N-gram, cotour (autocorrelatio) files Fixed-Time Segmetatio + Cotour Matchig, Relative pitch HMM Matchig, CubyHum Matcher MFCC, Audio Power, Stereo pa F0, Voice Formats, removal to SAX No extractio Chroma + derivatives retrieve lead coefficiets Edit distace (1 st ad 2 d order) sigig voice Mai F0 Costat-Q Costat-Q Wavelet Coefficiet s cotour Trasform + Trasform + coefficiets compariso heuristics heuristics Tempo Scorig fuctio Multiple F0 PreFEst PreFEst ormalizatio + (absorbs key cotours variatio variatio logarithm of differeces) frequecies values Mai F0 Salamo [2] MELODIA MELODIA cotour Rocamora [7] Semitoe-bad based chromagrams with fixed-time segmetatio Lead sigig YIN + eergy-based Sigig voice Pitch ad voice segmetatio detectio ad duratio ratios ad extractio (+ query process) (relative ecodig) Q max Edit distace Table 1. Summary of related QBH approaches. C MATCH C MISMATCH C INSERTION C DELETION T1 1-0.5-0.5-0.5 T2 1-1 -0.5-0.5 T3 1-1 -1-1 T4 1-0.5-1 -1 Table 2. Weights of the four tested cofiguratios for the Smith-Waterma aligmet algorithm. applicatios [20] as queries usually costitute a excerpt of the elemet to be retrieved. The cost fuctio used i this paper has bee the Edit distace (ED) [17]. 5. MELODY ABSTRACTIONS We ow describe the three cosidered melody abstractios for ecodig the estimated melodic pitch cotours. 5.1 Symbolic Aggregate Approximatio (SAX) SAX, itroduced by Li et al. [11] i 2007, is a symbolic represetatio for time series (sequeces ecoded as strigs) able to cope with two major drawbacks usually foud i other methods: the eed for both a dimesioality reductio ad a lower boud i the distace computatios. Although reported as a fast ad competitive algorithm for similarity search, SAX has ot bee widely used i Music Iformatio Retrieval (MIR). Some of the few examples i

Preprocessig Temporal segmetatio Cotour quatizatio Normalizatio PAA Statistical ecodig SAX sequece Melodic cotour PAA Semitoe ecodig PAA ST sequece Sigal smoothig Pitch chage Glitch removal Semitoe ecodig PC ST sequece Figure 3. Diagram depictig the differet stages the three proposed abstractios comprise. this field ca be foud i the study of guitar articulatios [21], Beijig opera sigig similarity [22] or i QBH [1]. SAX comprises three steps for codig ay sequece: 5.1.1 Time-series ormalizatio Give a time series C = {c 1, c 2,..., c } of legth, this abstractio performs a iitial ormalizatio process: c i = c i µ σ 1 i (1) were c i represets each elemet of the iitial time series (the f0 cotour i cets 3 retrieved by MELODIA) ad µ ad σ the mea ad the stadard deviatio respectively. 5.1.2 Piecewise Aggregate Approximatio (PAA) This secod stage takes the ormalized time series C of legth ad maps it i a M-dimesioal (modifiable parameter) vector C = { c 1, c 2,..., c M } of equally-sized segmets: c i = M j= M i M (i 1)+1 c j 1 i M (2) Give the differet legth of the f0 sequeces to ecode, fixig a global M value would produce each segmet to represet a differet temporal duratio i each sequece. Istead, we fix a frame temporal duratio τ t for all sequeces. Sice each c i represets t MEL, the frame size i samples ca be obtaied as τ s = τ t / t MEL. Thus, M is give by M = /τ s. As a iitial experimet, τ t values cosidered are 0.3, 0.5, 0.8, 1 ad 2 secods. 5.1.3 Symbolic represetatio The last stage maps C to a series of a (adjustable parameter) discrete symbols. To assure equiprobability of appearace for all symbols, a regios are defied based o a statistical distributio, typically Gaussia [11]. The group of breakpoits B = (β 1, β 2,..., β a 1 ) for delimitig such regios accomplish that the area uder a N (0, 1) Gaussia curve from β j to β j+1 equals 1/a. I additio, β 0 = ad β a = +. 3 The referece frequecy is 55 Hz as it represets the miimum frequecy value retrieved by MELODIA. Each iterval [β j 1, β j ) represets a certai symbol α j. Therefore, M-legth vector C = { c 1, c 2,..., c M } is mapped ito the M-legth vector Ĉ = (ĉ 1, ĉ 2,..., ĉ M ): ĉ i = α j if c i [β j 1, β j ) 1 i M 1 j a As a exploratory study, the a tested values have bee 3, 4, 6, 8, 12, 16 ad 20. (3) 1 Figure 4. Example of the SAX abstractio process with a = 5 ad τ s = 0.3 s: (a) Iitial time series i cets; (b) Normalized time series (solid) ad PAA codificatio (dashed); (c) PAA codificatio (solid) ad SAX ecodig breakpoits (dotted). 5.2 PAA temporal segmetatio with semitoe quatizatio (PAA ST) The first proposed SAX modificatio revises the Symbolic represetatio stage: istead of usig a statistical distributio approach for the vertical quatizatio, a fixed grid with semitoe divisios is established. The miimum cosidered frequecy value is 55 Hz give it is the miimum f0 retrieved by MELODIA. The ormalizatio stage is omitted as it modifies the pitch rage. Foldig the cotour to a sigle octave as i [2] was discarded as prelimiary oexhaustive experimetatio did ot report improvemets. Fially, relative pitch ecodig is applied (storig itervals betwee segmets) to provide traspositio ivariace. I this abstractio, the assessed time duratios for the PAA segmets have bee the same as i the SAX abstractio.

1 Figure 5. Example of the PAA ST abstractio process with τ s = 0.3 s: (a) Iitial time series i cets (solid); (b) Iitial time series i cets (solid) ad PAA codificatio (dashed); (c) PAA codificatio (solid) ad semitoe grid breakpoits (dotted). 5.3 Pitch chage segmetatio with semitoe quatizatio (PC ST) This secod modificatio builds o the previous oe but avoids PAA ad dyamically segmets the melodic cotour whe there is a pitch chage evet. Vertical quatizatio usig a semitoe grid is maitaied. I order to avoid false segmets due to artifacts ad fast pitch chages the pitch cotour may cotai, a softeig process is applied. The softeig process comprises two steps: (a) a iitial sigal smoothig usig a average filter of τ SM duratio with slidig widow (applied before the semitoe quatizatio process) ad (b) a glitch removal step by applyig a media filter of τ GR with slidig widow for removig segmets shorter tha a certai duratio (applied after the semitoe quatizatio step). We have studied four differet filter duratios: 25, 50, 75 ad 100 pitch samples. Give the MELODIA aalysis rate, these values correspod to filter duratios τ SM ad τ GR of 70, 140, 218 ad 290 millisecods respectively. 1 Figure 6. Example of the PC ST abstractio process with τ SM = 70 ms ad τ GR = 140 ms: (a) Iitial time series i cets (solid), smoothed cotour after the first filter (dashed) ad semitoe grid (dotted); (b) absolute semitoe ecodig; (c) absolute semitoe ecodig after the secod filter, the cross symbol ( ) poits out each ew temporal segmet. 6.1 Dataset 6. EVALUATION METHODOLOGY The evaluatio data is the same as i [2] ad it comprises a query corpus ad a music collectio. The music collectio, or cadidate sogs, cotais 2125 commercial sogs [19] distributed i 523 groups (each oe beig a group of covers of the same sog). Sog legths rage from 0.5 to 8 miutes with a average duratio of 3.6 miutes. Followig the evaluatio strategy i [2], the collectio is divided ito two subsets: a first oe cotaiig oly caoical sogs 4 from the corpus (481 elemets) ad a secod oe comprisig the etire music collectio (2125 elemets). The freely-available query corpus set 5 comprises a total of 118 queries recorded by 17 users (9 female ad 8 male) whose musical kowledge raged from oe to amateur musicia, with a average of 6.8 queries per user (1 as a miimum ad 11 as a maximum). As referece sogs, users chose amog the 481 caoical subset of the music collectio. Queries rage from 11 to 98 secods, with a average legth of 28.6 secods. 6.2 Measures Geerally, a QBH system is assessed usig rak metrics as its output is a sorted list of the similarity scores betwee the query ad each cadidate melody. I these terms, the two most commo evaluatio measures are the Mea Reciprocal Rak (MRR) ad the Top-X Hit Rate. 6.2.1 Mea Reciprocal Rak (MRR) For a give user query Q, correspodig to a target sog A, the system returs sorted list i which sog A is located at positio (or rak) r. The Reciprocal Rak (RR) for A is defied as 1/r. Geeralizig for a series of queries, the Mea Reciprocal Rak (MRR) is defied as: MRR = 1 N N i=1 1 r (Q i ) Scores obtaied fall i the rage 0 MRR 1, where 0 stads for the worst case ad 1 for the best. For ay of the evaluatio sets cosidered, r is assumed to be highest-raked versio matchig query Q. 6.2.2 Top-X Hit Rate Give the resultig rak, this measure checks whether the positio r of the matchig elemet of Q is amog the first X positios of the list, i.e. r (Q i ) X. This estimates the frequecy of retrievig the correct result amog the first X positios [2]. As i the previous case, the highest-raked versio which matches query Q is cosidered as r. 4 The sogs as published by the bad who composed/played it. 5 http://mtg.upf.edu/dowload/datasets/mtg-qbh. (4)

7.1 Results 7. RESULTS AND DISCUSSION Results obtaied for the abstractios ad aligmet algorithms cosidered are preseted i Table 3. Due to space requiremets, oly best result obtaied for each cofiguratio is reported. I order to cosistetly assess these results, a baselie cofiguratio has bee added: for each query, the cadidates rak is radomly sorted (without performig ay similarity measure) ad the evaluatio figures are the obtaied; the results show for this cofiguratio costitute the average of 10,000 repetitios. Results from [2] are also icluded for a comparative assessmet. We ote that all the proposed cofiguratios sigificatly outperform the MRR figure of 0.014 obtaied with the cosidered baselie. However, the results are still cosiderably lower tha the oes obtaied i [2]. Nevertheless, the differeces i performace amog the differet cofiguratios allow us to make some iterestig observatios. We see that the combiatio of SAX with the Smith- Waterma aligmet obtais a MRR of 0.05 whe evaluated agaist the caoical (481 sogs) test set. The semitoe quatizatio step, which costitutes the oly differece with the SAX abstractio process, does ot sigificatly affect the results with respect to the SAX oes (MRR score is ow aroud 0.04). This is a poit to be remarked sice, although the abstractio is more related to a actual music represetatio, the accuracy scores obtaied are similar to the oes obtaied with SAX. PC ST assesses the ifluece of ote segmetatio i the process. Focusig o the caoical set ad the Smith- Waterma aligmet, this particular ecodig methodology achieves a MRR score aroud 0.09, thus outperformig the two other abstractios. This suggests that musicallyiformed temporal segmetatio of pitch sequeces may beefit the performace of the system. As expected from [2], the iclusio of cover sogs amog the cadidates set ehaces retrieval accuracy for our cofiguratios, except for the PAA ST: while for both SAX ad the PC ST there is a improvemet of 0.05 i the MRR measure, results i the PAA ST do ot sigificatly vary i compariso with the caoical set. Results obtaied for the Top-X Hit Rate measure also support our observatio that a proper temporal segmetatio i the process is beeficial for the system. Whe oly cosiderig the caoical set, the correct cadidate is retrieved o the first positio aroud 3 % ad 1 % of the time for the SAX ad the PAA ST respectively while, whe cosiderig the PC ST, this figure goes close to 6 %. This same coclusio ca be observed with the rest of the Hit Rates (3, 5 ad 10) as well as with the iclusio of covers amog the cadidates. Focusig o the aligmet algorithms, although the differet proposed Smith-Waterma cofiguratios show some ifluece o the overall accuracy, there is o clear outperformig cofiguratio for all the cases. Results obtaied with Subsequece Dyamic Time Warpig show lower performace tha the other cosidered aligmet algorithm. This may be improved with the use of more complex cost fuctios rather tha the cosidered Edit distace. 7.2 Discussio While the proposed SAX abstractio has bee show to perform successfully for a variety of time-series tasks [11], results i the experimets proposed suggest that this is ot the case for musical time-series data i the cotext of QBH. The most likely reaso for this to happe is the fact that SAX does ot cosider ay particularities the origi domai of the time series may have. Thus, i the case of QBH, SAX may be abstractig away musically-related iformatio from the melodic cotours required for properly performig the aligmet. This idea is further supported by the improvemet i the results whe usig the PC ST abstractio as, although i a very aïve way, it tries to segmet the differet musical otes preset i the cotour. The results obtaied i the two modificatios proposed support the relevace of usig musically-iformed temporal segmetatio of the cotour. I this study, the use of a basic temporal segmetatio based o pitch chage evets leads to accuracy improvemets whe compared to the use of the PAA dimesioality reductio algorithm. The most likely reaso for this is agai the fact that the use of the PAA algorithm does ot take ito accout the musical ature of the data to ecode, thus abstractig away relevat iformatio ecessary for the aligmet. I these terms, the use of more sophisticated temporal segmetatio techiques for music data, as for istace oset detectio, could improve these results. Although the abstractios studied i this paper are ot competitive i terms of a practical QBH system, evidece from previous work (cf. [2]) shows o-trascriptio abstractios may lead to successful results. These results ecourage the exploratio of other abstractios to provide competet alteratives to trascriptio-based QBH systems. 8. CONCLUSIONS Query-by-Hummig (QBH) systems costitute a particular type of music search egie i which the query is a sug or hummed excerpt of the mai melody of a sog. Most ofte, these schemes rely o both existig music aotatios ad fully-automated music trascriptio algorithms for performig the melodic similarity. Although may examples of QBH systems have bee proposed uder this premise, its limited scalability together with the fact that o full automatic trascriptio algorithm is error-free clearly limits their performace i practical situatios. I this work we assessed the ifluece of this particular step i such systems by usig of three melody ecodig alteratives which avoid full music trascriptio. More precisely, startig from the geeral time-series ecodig method Symbolic Aggregate Approximatio (SAX), we modify this algorithm by icorporatig music-based pitch quatizatio ad segmetatio for evaluatig their ifluece i the cotext of a QBH system. Results obtaied suggest that the time-series represetatio algorithm SAX does ot seem to be suitable for melody aligmet i the cotext of Query by Hummig. I this sese, the mai out-

Approach SAX Evaluatio Aligmet Algorithm Top-X Hit Rate (%) MRR subset algorithm cofiguratio 1 3 5 10 Caoical Complete T1 0.0500 2.54 5.93 7.63 9.32 T2 0.0566 2.54 5.93 5.93 11.02 T3 0.0632 4.24 5.93 5.93 9.32 T4 0.0472 3.39 4.24 5.08 6.78 S-DTW ED 0.0333 1.69 3.39 3.39 8.47 T1 0.1117 7.63 11.86 12.71 17.80 T2 0.1155 7.63 11.86 12.71 17.80 T3 0.0962 5.08 10.17 11.86 14.41 T4 0.0849 5.08 8.47 11.02 12.71 S-DTW ED 0.0443 2.54 4.24 5.08 8.47 PAA ST PC ST Caoical Complete Caoical Complete T1 0.0515 2.54 4.24 6.78 11.02 T2 0.0421 1.69 3.39 4.24 9.32 T3 0.0391 1.69 2.54 4.24 6.78 T4 0.0424 1.69 4.24 4.24 5.93 S-DTW ED 0.0346 1.69 2.54 3.39 5.93 T1 0.0396 1.69 2.54 5.93 9.32 T2 0.0424 1.69 3.39 4.24 8.47 T3 0.0406 1.69 3.39 5.08 8.47 T4 0.0558 3.39 5.08 5.93 9.32 S-DTW ED 0.0334 1.69 2.54 6.78 9.32 T1 0.0894 5.93 9.32 10.17 12.71 T2 0.0967 6.78 11.86 12.71 15.25 T3 0.0957 6.78 8.47 12.71 14.41 T4 0.0772 5.08 6.78 8.47 12.71 S-DTW ED 0.0165 0.00 0.85 1.69 4.24 T1 0.1447 10.17 14.41 17.80 24.58 T2 0.1460 10.17 16.95 19.49 22.88 T3 0.1563 11.02 16.95 17.80 22.88 T4 0.1447 10.17 14.41 17.80 24.58 S-DTW ED 0.0181 0.00 0.85 0.85 3.39 Baselie Salamo [2] Caoical Radom 0.0140 0.21 0.62 1.03 2.06 Complete Radom 0.0039 0.05 0.15 0.25 0.50 Caoical Q max 0.45 40.68 47.46 49.15 51.69 Complete Q max 0.56 50.85 58.47 61.02 66.10 Table 3. MRR ad Top-X Hit Rate results obtaied for the proposed experimetatio. Figures represet the best score achieved i each particular abstractio cofiguratio. come of this study is that, give the complexity of Query by Hummig, musically-related abstractios should be cosidered for ecodig the cotours. Future work will cosider the icorporatio of the coclusios obtaied i this work to the abstractio proposed i [2]: as the abstractio i the cited work performs a chromagram represetatio with a fixed-time temporal segmetatio, the icorporatio of dyamically-based segmetatio could improve the results obtaied. Moreover, give the relevace of the user i this particular task, iteractive patter recogitio paradigms for addressig the similarity step could be cosidered: whe a query is icorrectly aswered, the system could modify the dissimilarity measure (metric learig) to icorporate the user s feedback. Ackowledgmets This research work has bee partially supported by Cosejería de Educació de la Comuitat Valeciaa through project PROMETEO/2012/017, Vicerrectorado de Ivestigació, Desarrollo e Iovació de la Uiversidad de Alicate through FPU programme (UAFPU2014 5883), the Spaish Miisterio de Ecoomía y Competitividad through project TIMuL (No. TIN2013 48152 C2 1 R, supported by EU FEDER fuds) ad the Spaish etity Fudació

Obra Social lacaixa. Authors would also like to thak José M. Iñesta for kidly proofreadig this paper. 9. REFERENCES [1] A. Duda, A. Nürberger, ad S. Stober, Towards Query by Sigig/Hummig o Audio Databases, i Proceedigs of the 8th Iteratioal Coferece o Music Iformatio Retrieval (ISMIR), Austria, 2007, pp. 331 334. [2] J. Salamo, J. Serrà, ad E. Gómez, Toal Represetatios for Music Retrieval: From Versio Idetificatio to Query-by-Hummig, Iteratioal Joural of Multimedia Iformatio Retrieval, special issue o Hybrid Music Iformatio Retrieval, vol. 2, o. 1, pp. 45 58, 2013. [3] R. B. Daeberg, W. P. Birmigham, B. Pardo, N. Hu, C. Meek, ad G. Tzaetakis, A Comparative Evaluatio of Search Techiques for Query-by-hummig Usig the MUSART Testbed, Joural of the America Society for Iformatio Sciece ad Techology, vol. 58, o. 5, pp. 687 701, 2007. [4] D. Little, D. Raffesperger, ad B. Pardo, A Query by Hummig System that Lears from Experiece, i Proceedigs of the 8th Iteratioal Coferece o Music Iformatio Retrieval (ISMIR), Austria, 2007, pp. 335 338. [5] A. Ito, Y. Kosugi, S. Makio, ad M. Ito, A query-byhummig music iformatio retrieval from audio sigals based o multiple F0 cadidates, i Proceedigs of the Iteratioal Coferece o Audio Laguage ad Image Processig (ICALIP), Chia, 2010, pp. 1 5. [6] M. Ryyäe ad A. Klapuri, Query by hummig of midi ad audio usig locality sesitive hashig, i Proceedigs of the IEEE Iteratioal Coferece o Acoustics, Speech ad Sigal Processig (ICASSP), USA, 2008, pp. 2249 2252. [7] M. Rocamora, P. Cacela, ad A. Pardo, Query by hummig: Automatically buildig the database from music recordigs, Patter Recogitio Letters, vol. 36, o. 1, pp. 272 280, 2014. [8] J. Salamo, E. Gómez, D. P. Ellis, ad G. Richard, Melody Extractio from Polyphoic Music Sigals: Approaches, applicatios, ad challeges, IEEE Sigal Processig Magazie, vol. 31, o. 2, pp. 118 134, 2014. [9] R. Typke, Music retrieval based o melodic similarity, Ph.D. dissertatio, Utrecht Uiversity, Netherlads, February 2007. [10] E. Beetos, S. Dixo, D. Giaoulis, H. Kirchhoff, ad A. Klapuri, Automatic music trascriptio: challeges ad future directios. J. Itell. If. Syst., vol. 41, o. 3, pp. 407 434, 2013. [11] J. Li, E. Keogh, L. Wei, ad S. Loardi, Experiecig SAX: A Novel Symbolic Represetatio of Time Series, Data Miig ad Kowledge Discovery, vol. 15, o. 2, pp. 107 144, 2007. [12] J. Salamo ad E. Gómez, Melody Extractio from Polyphoic Music Sigals usig Pitch Cotour Characteristics, IEEE Trasactios o Audio, Speech ad Laguage Processig, vol. 20, o. 6, pp. 1759 1770, 2012. [13] T. Smith ad M. Waterma, Idetificatio of commo molecular subsequeces, Joural of Molecular Biology, vol. 147, o. 1, pp. 195 197, 1981. [14] M. Müller, Iformatio retrieval for music ad motio. Spriger, 2007. [15] A. Ghias, J. Loga, D. Chamberli, ad B. C. Smith, Query by Hummig: Musical Iformatio Retrieval i a Audio Database, i Proceedigs of the 3rd ACM Iteratioal Coferece o Multimedia, USA, 1995, pp. 213 236. [16] W. Jeo, C. Ma, ad Y. M. Che, A Efficiet Sigal- Matchig Approach to Melody Idexig ad Search Usig Cotiuous Pitch Cotours ad Wavelets, i Proceedigs of the 10th Iteratioal Society for Music Iformatio Retrieval Coferece (ISMIR), Japa, 2009, pp. 681 686. [17] R. A. Wager ad M. J. Fischer, The Strig-to-Strig Correctio Problem, Joural of the Associatio for Computig Machiery, vol. 21, o. 1, pp. 168 173, 1974. [18] M. Goto, A real-time music-scee-descriptio system: predomiat-f0 estimatio for detectig melody ad bass lies i real-world audio sigals, Speech Commuicatio, vol. 43, o. 4, pp. 311 329, 2004. [19] J. Serrà, H. Katz, X. Serra, ad R. G. Adrzejak, Predictability of Music Descriptor Time Series ad its Applicatio to Cover Sog Detectio, IEEE Trasactios o Audio, Speech, ad Laguage Processig, vol. 20, o. 2, pp. 514 525, 2012. [20] X. Aguera ad M. Ferraros, Memory efficiet subsequece DTW for Query-by-Example Spoke Term Detectio, i IEEE Iteratioal Coferece o Multimedia ad Expo (ICME), USA, 2013, pp. 1 6. [21] T. H. Özasla ad J. L. Arcos, Legato ad Glissado idetificatio i Classical Guitar, i Proceedigs of the 7th Soud ad Music Computig Coferece (SMC), Spai, 2010, pp. 457 463. [22] S. Zhag, R. C. Repetto, ad X. Serra, Study of the Similarity Betwee Liguistic Toes ad Melodic Pitch Cotours i Beijig Opera Sigig, i Proceedigs of the 15th Iteratioal Society for Music Iformatio Retrieval Coferece (ISMIR), Taiwa, 2014, pp. 343 348.