AMusicSearchEnginebasedonSemantic Text-Based Query

Similar documents
Bi-Modal Music Emotion Recognition: Novel Lyrical Features and Dataset

Enhancing Music Maps

A User-Oriented Approach to Music Information Retrieval.

MUSI-6201 Computational Music Analysis

Music Mood. Sheng Xu, Albert Peyton, Ryan Bhular

ABSOLUTE OR RELATIVE? A NEW APPROACH TO BUILDING FEATURE VECTORS FOR EMOTION TRACKING IN MUSIC

Lyric-Based Music Mood Recognition

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A Categorical Approach for Recognizing Emotional Effects of Music

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA

Melody Retrieval On The Web

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

Quality of Music Classification Systems: How to build the Reference?

Music Recommendation from Song Sets

FULL-AUTOMATIC DJ MIXING SYSTEM WITH OPTIMAL TEMPO ADJUSTMENT BASED ON MEASUREMENT FUNCTION OF USER DISCOMFORT

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Subjective Similarity of Music: Data Collection for Individuality Analysis

Unifying Low-level and High-level Music. Similarity Measures

Predicting Time-Varying Musical Emotion Distributions from Multi-Track Audio

arxiv: v1 [cs.ir] 16 Jan 2019

Expressive information

HOW COOL IS BEBOP JAZZ? SPONTANEOUS

Music Radar: A Web-based Query by Humming System

Reducing False Positives in Video Shot Detection

Automatic Music Clustering using Audio Attributes

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. X, NO. X, MONTH Unifying Low-level and High-level Music Similarity Measures

Research & Development. White Paper WHP 232. A Large Scale Experiment for Mood-based Classification of TV Programmes BRITISH BROADCASTING CORPORATION

Large scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs

Using Genre Classification to Make Content-based Music Recommendations

Automatic Rhythmic Notation from Single Voice Audio Sources

MUSICAL MOODS: A MASS PARTICIPATION EXPERIMENT FOR AFFECTIVE CLASSIFICATION OF MUSIC

Music Mood Classification - an SVM based approach. Sebastian Napiorkowski

Mood Tracking of Radio Station Broadcasts

Crossroads: Interactive Music Systems Transforming Performance, Production and Listening


INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

CTP431- Music and Audio Computing Music Information Retrieval. Graduate School of Culture Technology KAIST Juhan Nam

Music Information Retrieval with Temporal Features and Timbre

Wipe Scene Change Detection in Video Sequences

Release Year Prediction for Songs

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

A Comparison of Methods to Construct an Optimal Membership Function in a Fuzzy Database System

Music Segmentation Using Markov Chain Methods

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Probabilist modeling of musical chord sequences for music analysis

Identifying Related Documents For Research Paper Recommender By CPA and COA

A Large Scale Experiment for Mood-Based Classification of TV Programmes

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Research & Development. White Paper WHP 228. Musical Moods: A Mass Participation Experiment for the Affective Classification of Music

HUMAN PERCEPTION AND COMPUTER EXTRACTION OF MUSICAL BEAT STRENGTH

Time Series Models for Semantic Music Annotation Emanuele Coviello, Antoni B. Chan, and Gert Lanckriet

THE EFFECT OF EXPERTISE IN EVALUATING EMOTIONS IN MUSIC

Singer Traits Identification using Deep Neural Network

Comparison of Dictionary-Based Approaches to Automatic Repeating Melody Extraction

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Toward Multi-Modal Music Emotion Classification

Perceptual dimensions of short audio clips and corresponding timbre features

The Million Song Dataset

HOW SIMILAR IS TOO SIMILAR?: EXPLORING USERS PERCEPTIONS OF SIMILARITY IN PLAYLIST EVALUATION

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

POLYPHONIC INSTRUMENT RECOGNITION FOR EXPLORING SEMANTIC SIMILARITIES IN MUSIC

A Framework for Segmentation of Interview Videos

VECTOR REPRESENTATION OF EMOTION FLOW FOR POPULAR MUSIC. Chia-Hao Chung and Homer Chen

Deriving the Impact of Scientific Publications by Mining Citation Opinion Terms

2 2. Melody description The MPEG-7 standard distinguishes three types of attributes related to melody: the fundamental frequency LLD associated to a t

Analysis of local and global timing and pitch change in ordinary

Music Information Retrieval Community

Feature-Based Analysis of Haydn String Quartets

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

Combination of Audio & Lyrics Features for Genre Classication in Digital Audio Collections

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

1. BACKGROUND AND AIMS

Retrieval of textual song lyrics from sung inputs

Computer Coordination With Popular Music: A New Research Agenda 1

Statistical Modeling and Retrieval of Polyphonic Music

CS229 Project Report Polyphonic Piano Transcription

WHEN LYRICS OUTPERFORM AUDIO FOR MUSIC MOOD CLASSIFICATION: A FEATURE ANALYSIS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

A perceptual assessment of sound in distant genres of today s experimental music

Musical Entrainment Subsumes Bodily Gestures Its Definition Needs a Spatiotemporal Dimension

Music Information Retrieval

ESTIMATING THE ERROR DISTRIBUTION OF A TAP SEQUENCE WITHOUT GROUND TRUTH 1

Automatic Identification of Instrument Type in Music Signal using Wavelet and MFCC

Detecting Musical Key with Supervised Learning

WYNER-ZIV VIDEO CODING WITH LOW ENCODER COMPLEXITY

Proceedings of Meetings on Acoustics

Music Information Retrieval. Juan P Bello

Music Similarity and Cover Song Identification: The Case of Jazz

Citation Proximity Analysis (CPA) A new approach for identifying related work based on Co-Citation Analysis

Improving MeSH Classification of Biomedical Articles using Citation Contexts

A repetition-based framework for lyric alignment in popular songs

The relationship between properties of music and elicited emotions

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

POLITECNICO DI TORINO Repository ISTITUZIONALE

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Transcription:

AMusicSearchEnginebasedonSemantic Text-Based Query Michele Buccoli 1,MassimilianoZanoni 2,AugustoSarti 2,StefanoTubaro 2 Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano piazza Leonardo da Vinci 32 20133 Milano, Italy 1 michele.buccoli@mail.polimi.it 2 zanoni,sarti,tubaro@elet.polimi.it Abstract Search and retrieval of songs from a large music repository usually relies on added meta-information (e.g., title, artist or musical genre); or on specific descriptors (e.g. mood); or on categorical music descriptors; none of which can specify the desired intensity. In this work, we propose an early example of semantic text-based music search engine. The semantic description takes into account emotional and non-emotional musical aspects. The method also includes a query-by-similarity search approach performed using semantic cues. We model both concepts and musical content in dimensional spaces that are suitable for carrying intensity information on the descriptors. We process the semantic query with a Natural Language parser to capture only the relevant words and qualifiers. We rely on Bayesian Decision theory to model concepts and songs as probability distributions. The resulted ranked list of songs are produced through a posterior probability model. A prototype of the system has been proposed to 53 subjects for evaluation, with good ratings on performance, usefulness and potential. I. INTRODUCTION In the past decade the way we retrieve, organize, browse and listen to music has undergone a deep transformation. Numerous commercial solutions where the desired musical content is just one click away have, in fact, multiplied. This enables new scenarios for music diffusion, fruition and democratization, and paves the way to new forms of social networks for music: Soundcloud 1,Spotify 2,etc.However,with such a formidable amount of content at hand, the user is exposed to the risk of information overload: which leads to the paradoxical situation of freedom/ease of access becoming an obstacle to finding what we are looking for. Until not so long ago, what played the role of mediator between the user and the content were vendors, promoters (magazines, radios stations, etc.) and geographical constraints. In fact, they all contributed to creating a perspective, a hierarchy of importance on musical content. In the current scenario users have direct access to content, thus weakening 1 https://soundcloud.com 2 https://www.spotify.com/ MMSP 13, Sept. 30 - Oct. 2, 2013, Pula (Sardinia), Italy. 978-1-4799-0125-8/13/$31.00 c 2013 IEEE. the role of mediators and, as a consequence, flattening our perspective. This problem was understood quite early by scientific communities (particularly that Music Information Retrieval, MIR), which focused from the start on the creation of novel mediators. Despite the availability of solutions for content-based music information retrieval, music collections are usually still managed and accessed through techniques that use meta information such as artist name, title, etc. (e.g. Last.fm 3 [1] and itunes 4 ). There are also solutions that enable the search for meta descriptors starting from partial content descriptors. These solutions fall within the area of music search, within MIR. Shazam 5,forexample,isanapplicationthatretrieves songs from a recorded sample [2]; Soundhound 6 performs music search and retrieval with a query-by-humming paradigm. Recent studies [3], however, have pointed out the emerging and growing need of users to interact with systems at a higher level of abstraction. In [3] the author proposes an audio fragment search engine based on semantic descriptors and on an acoustic feature space. In [4] the authors introduce a music search system with query by semantic description based on a vocabulary of 159 descriptors related to emotion, genre, usage, etc. Stereomood 7 is a web service that retrieves music that exhibits a desired mood. Proposed tools introduce a semantic description of music content, but the paradigm still relies on the categorical approach, which cannot express how well the concept describes the excerpt. This precludes the use of qualifiers such as not so happy, very aggressive, etc.moreover,queriesaregenerally expressed in a pre-structured form. In this work we address music search issues based on query by semantic description. We refer to this paradigm as semantic text-based search engine. The paradigm provides natural language based queries to exploit the richness of 3 Last.fm Ltd., http://www.last.fm 4 Apple Inc., http://www.apple.com/itunes/ 5 Shazam Entertainment Ltd, http://www.shazam.com 6 Soundhound Inc. http://www.soundhound.com 7 Stereomood srl., http://www.stereomood.com

language and capture the significant concepts and qualifiers. Natural Language Processing (NLP) [5] is a discipline that concerns making machines able to understand human natural language. In the literature semantic descriptors are classified as Emotional Descriptors (ED) and Non-Emotional Descriptors (NED).WeusedimensionalapproachbothforEDsandNEDs. Dimensional approach to emotion conceptualization aims at mapping emotions in 2D, 3D spaces. The most widespread choice is the Valence-Arousal (VA) space [6] where each affective terms is represented as a point. While a relation between affective terms is rather apparent in EDs, we cannot state the same for NEDs. In fact, no space that is able to represent such descriptors has been proposed, which is why we model each NED on a 1D semantic space. Query-by-similarity is based on a different paradigm for music description. Itunes Genius [7] suggests music that appears to be similar to the user s collection, though it does so based on meta-data descriptors. In [8] and in the Spotify application, query by similarity is performed using both acoustic descriptors and meta information. In our paradigm the similarity is computed in the ED and NED semantic spaces. In order to be able to compare different types of descriptions and queries we propose a combined dimensional space that includes both approaches. Using the Bayesian Theory on the semantic space, we model EDs, NEDs and songs as distributions of a-priori probabilities and the final score (used for ranking the playlist) as posterior probabilities. We implemented a prototype of the system as a web search engine that returns a ranked list of songs. II. APPROACH OVERVIEW In the semantic text-based search engine, there are three key aspects to consider: the modeling of concepts, the modeling of music content, the modeling of queries. The model we propose in this paper relies on Bayesian Decision Theory. Once the query is modeled, for each song the posterior probability is computed to produce the final ranked list. The general scheme of the approach is shown in Fig. 1: Fig. 1. System block diagram with detail view of computational core. A. Dimensional Approach to Concept Modeling In this paper we modeled both Emotional and Non- Emotional Descriptors. 1) Emotional Descriptors: One of the key proprieties of music is its ability to convey emotions. This is the main reason that pushed psychologist and musicologist to investigate paradigms for the representation of emotions. The most influenced dimensional music emotion conceptualization model so far is the circumplex model of affect proposed by Russell [6](Fig. 2). This model consists of a two-dimensional space composed by Arousal (A), linked to the degree of activation or excitement, and Valence (V), linked to the degree of pleasantness. Distances between points in the space are proportional to the semantic distances between words. In [9] Fig. 2. The Russell s circumplex model of effect. Affective terms are described in term of Arousal and Valence. Arousal represents thedegreeof activation. Valence is related to the degree of pleasantness. the authors collected a set of English affective words (ANEW) manually tagged. Means and standard deviation for Valence and Arousal annotations related to each concept are computed as a measure of the consensus by people. We exploit the consensus to model each term in the ANEW collection as anormaldistributionofprobability. Given an affective term w ED W ED,withW ED the set of affective terms, w ED (n VA ) N(µ w VA, Σw VA ) (1) where N ( ) denotes a normal distribution, n VA =[n V,n A ] T represents a point in the Valence-Arousal plane, µ w VA = [µ w V,µw A ]T are the mean values of Valence and Arousal from the ANEW dataset, Σ w VA = diag(σw VA )=diag([σw V,σw A ]T ) is the covariance matrix. Since the support of the probability distribution is limited, in order to uniform the distributions, w ED are normalized by an integration to 1. 2) Non-Emotional Descriptors: In [10] the authors proposed 27 semantic descriptors divided in affective/emotive, structural, kinaesthetic, judgement. Forourstudywechoseto model some of the structural and judgement bipolar descriptors and one kinaesthetic descriptor, as shown in table I. Each concept is modeled independently. Since the definition of gesture is equivalent to the definition of grooviness proposed in [11], in this study we prefer the term groovy, whichis

Non-Emotional Descriptors Structural Soft/Hard Clear/Dull Rough/Harmonious Void/Compact Flowing/Stuttering Dynamic/Static Kinaesthetic Gesture Judgement Easy/Difficult TABLE I LIST OF NON-EMOTIONAL DESCRIPTORS CHOSEN FROM THE ANNOTATION EXPERIMENT IN [10] The normal distributions are normalized such that w T (T 1 )=w T (T 2 )=1 (5) w T (n) 1.5 adagio moderato presto 1 andante allegro 0.5 more widely used in musical context. Moreover, given the importance of rhythmic information, we include the bipolar Tempo descriptor Fast/Slow. Tempo is a general description of the speed of a song and it is generally expressed in beatsper-minute (BPM). Non-Emotional Descriptors are modeled using normal distributions such as: w + d (n) N(µ+ d,σ d) and w d (n) =N (µ d,σ d), (2) where w + d and w d are the first and the second term, respectively, of the bipolar descriptor d, n [0..1]; µ + d =0, µ d = 1 are the mean values of the left and right bounds of n; andσ d is the standard deviation, which is set to 0.5, as this is the value that splits the space in the two opposite terms. Concerning the Fast/Slow descriptor, the Tempo value in BPM is normalized in the range [0...1]. AsGrooviness is not expressed as bipolar concept, we formalize it as: w d (n) N(1, 0.5) where n [0..1]. Allthedistributions are finally to its maximum value. In the context of music transcription a common way to express tempo through Tempo Markings indicating the portamento of the piece. We include Tempo Markings in the search engine and we model them by exploiting the correlation with BPM ranges such as proposed in [12] (Table II). To capture possible Tempo fluctuations in the song, we model the Tempo Markings partially as normal distributions and partially as uniform distributions, as shown in Fig. 3. TM Adagio Andante Moderato Allegro Presto BPM 66-76 77-108 109-120 121-168 169-200 TABLE II TEMPO MARKINGS (TM) AND CORRESPONDENT RANGES OF BPM In particular, we fix the standard deviation as: σ w T = β (T 2 T 1 ), (3) where T 1 and T 2 are the bounds for the BPM range of the word w, withw = {adagio, andante, moderato, allegro, presto} and β is experimentally determined as β =0.25. The model is formalized as: w T (n) N(T 1,σ w T ) if n T 1 =1 if n (T 1,T 2 ) N(T 2,σ w T ) if n T 2 (4) 0 60 80 100 120 140 160 180 200 220 n Fig. 3. Concept modeling for tempo markings words as listed in tableii.in the x-axis there are BPM, in the y-axis the tempo markings modeled in the BPM mono dimensional space. B. Music Content Semantic Description Modeling We model songs in both ED and NED spaces in order to compare to concepts. 1) Emotional Descriptors: Songs are manually annotated in the VA space. For each song mean µ s VAand standard deviation σ s VA of the annotations are computed. As well as for concepts, in order to account for the consensus and the variation of annotation by people, we model songs as normal distributions in the VA plane: s ED (n VA ) N(µ s VA, Σs VA ). (6) where Σ s VA is the covariance matrix. Since the support of the probability distribution is limited, to uniform the distributions, s ED are normalized by an integration to 1. 2) Non-Emotional Descriptors: We compute means µ s d and standard deviations σd s for each descriptor and we model the songs in the data set as normal distributions in a monodimensional space: s d (n) N(µ s d,σs d ) (7) where n d is a point in the space and d D = {hard, clear, rough, comp, dyn, stutt, diff, groovy, BP M}. C. Query Modeling In our paradigm, queries are expressed using sentences relying on natural English language based on: EDs and NEDs; qualifiers; similarity with a song in the database. In the latter, it is also possible to use relative less and more qualifiers. Sentences are parsed using NLP techniques [5] to extract keywords. The output of a semantic parser is defined as semantic tree. Weusepart-of-speech(POS)taggingtoanalyze only: adjectives; foreign words (for Italian tempo markings) and qualifiers. Once a word w is found to be relevant, the semantic tree is parsed to capture qualifiers ψ w, if any. Qualifiers are used to alter the distribution probability of the related concept, by using a rescaled version of the mapping to a11-pointscaleproposedin[13](tableiii).

Verbal label Mean Value Verbal label Mean Value not at all 0.0 rather 5.8 not 0.4 quite 5.9 hardly 1.5 quite a bit 6.5 alittle 2.5 mainly 6.8 slightly 2.5 considerably 7.6 partly 3.5 very 7.9 somewhat 4.5 highly 8.6 in-between 4.8 very much 8.7 average 4.8 fully 9.4 medium 4.9 extremely 9.6 moderately 5.0 completely 9.8 fairly 5.3 TABLE III VERBAL LABELS AND CORRESPONDENT MEAN VALUES FROM [13] As far as the application of qualifiers to NEDs is concerned, the alteration effects the concept as follows: ψ w = [0...2.5]: thesemanticallyoppositedescriptoris considered and more evidence is assigned to values closer to the opposite bound: w d (n) N(1 µw d,ασw d ) α [0.5, 1], (8) where α is a scale factor for standard deviation, directly proportional to ψ w ; ψ w = [2.5...5]: thenovelconceptismodeledasanormal distribution centered on the rescaled version of ψ w : w d (n) N(µ w d,σ w d ), (9) where µ w d =0.1ψw and σ d w is fixed and experimentally set to 0.2; ψ w = [5...7.5]: sameasforψ w = [0...2.5]; ψ w = [7.5...10]: moreevidenceisassignedtovalues closer to the bound: w d (n) N(µw d,ασw d ) α [0.5, 1]. (10) where α is inversely proportional to ψ w In Fig. 4 an example of the four categories of qualifiers applied to the concept groovy is shown. Fig. 4. Example of application of qualifiers for the concept groovy. All the four category of qualifiers are shown: ψ w =[7.5...10], ψ w =[5...7.5], ψ. w =[2.5...5], ψ w =[0...2.5] Respect to the application of qualifiers to EDs, the alteration is modeled as follows: ψ w = [0...2.5]: theconceptattheantipodesofvaplane is considered, focused on its mean: w ED (n VA) N(1 µ w VA,αΣw VA ) α [0.5, 1], (11) where 1 = [1, 1] T and α is a scale factor directly proportional to ψ w ; ψ w = [2.5...5]: the final concept to consider should not be the original, but a conceptually similar one. For this reason a ring around the distribution of the original concept is generated: w ED (n VA) N(µ w VA,αΣw VA ) w ED(n VA ) α [1.5, 3], (12) where α is inversely proportionally to ψ w ; ψ w = [5...7.5]: thedistributionisrelaxed: w ED (n VA) N(µ w VA,αΣw VA ) α [1.5, 3]; (13) where α is a scale factor directly proportional to ψ w ; ψ w = [7.5...1]: valuescloserthecenterofthedistribution are highlighted: w ED (n VA) N(µ w VA,αΣw VA ) α [0.5, 1], (14) where α inversely proportional to ψ w. More and less qualifiers applied to query-by-similarity are modeled in the score computing phase. Their modeling will be described later in this paper. In order to obtain unique distributions for ED and similarity query space, we compute the joint distributions as the product of the original distributions: w ED (n VA )= w ED (n VA), (15) w ED Z ED s ED (n VA )= s ED (n VA ), (16) s Z S s NED (n VA )= s NED (n VA ), (17) s Z S where Z ED is the set of EDs in the query, and Z S is the set of songs retrieved by the query-by-similarity. The use of the product between normal distributions guarantees that the final result will still be a normal distribution. Since NEDs are independently modeled it is not needed to produce a unique distribution. D. Overall score computation An overall score for each song is needed to produce the resulting ranked list and is the conjunction of the text-based query score and the query-by-similarity score. As far as the text-based query score is concerned, it is proportional to the probability to match the query and is computed as the posterior probability given EDs, NEDs and similarity models. ED and NED scores for a song s are computed as: ξ s ED = w ED(µ s VA )P (s ED), (18) [ ] ξned s = wd (µ s 1 d )P (s Z d) NED, (19) where P (s ED ) and P (s d ) are the a-priori probabilities of s; Z NED is the set of NED in the query; and Z NED its

cardinality. As the strength of the consensus of the annotation is related to the variance, P (s ED ) and P (s d ) are inversely proportional to Σ s VA and σ s d.inordertocushiontheimpact of the consensus of the annotation on the posterior probability, P (s ED ) and P (s d ) are normalized in [0.8, 1].Asnoconsensus is available for the Tempo descriptor, its a-priori probability is set to 1. As for the query-by-similarity score for a song s, thiscan be computed as a product of song similarity ED and NED scores: ξ s S = ξ s ED,S ξs NED,S, (20) where ξed,s s and ξs NED,S are computed similarly to Eqs. (18) and (19), using s instead of w. ThequalifiersMore and less are applied only to query-by-similarity and they represent a constraint on the set of songs to consider. Given the d-th NED descriptor or w ED descriptor that the qualifier is applied to, scores for the qualifier more are computed as µ s d >µŝd or w ED (µ s VA ) >w ED(µŝV A ),foreachŝ d Z S.Thequalifier less is the dual of the qualifier more. In order to uniform the scores from different spaces, the overall score for s is computed by the geometric means of the partials: ξ s = ξ 3 S s ξs NED ξs ED (21) III. IMPLEMENTATION The ANEW [9] dataset of affective words includes over 2000 terms. However, many of them are not strictly related to mood, but more to application contexts (e.g. school or city). We filtered ANEW terms using Wordnet-Affect lexical database of English language [14], a subset of the Wordnet database [15] to retain only the concepts that are strictly related to mood. One of the main issues in building semantic music search is to collect a representative set of songs that are annotated using high-level descriptors. In [16], the authors collected a set of 240 excerpts of 15 seconds each. Each excerpt is annotated in the VA plane for each second. We averaged annotations related to each song (for each second and by all the testers) to produce the mean and the standard deviation. We expanded the dataset by adding annotations for NEDs through an online listening test. Five excerpts, among the 240 songs, were randomly proposed and testers were asked to rate each of the descriptors in a 9-point Likert scale, except for the Tempo, which is not manually annotated. Rates in the range [1,...,4] assigns the graded prevalence to the first concept, whereas rates in the range [6,...,9] to the second and 5 assert no preference. 166 people completed the test. In order to clean the set of annotations from possible outliers, we applied the Modified Z-score [17] algorithm (MZ-score). Excerpts that collected fewer than three rates for each descriptor after the MZ-score outliers analysis were discarded. We obtained annotations for 130 songs. For each Non-Emotional Descriptor, mean and standard deviation were finally computed. In order to provide NED annotations to the remaining 110 songs in the dataset, an automatic annotation system was applied. We used the 130 annotated excerpts to train a set of linear regressors and a set of robust linear regressors (one for each descriptor) [18]. The linear regressor exhibited the best performance, hence we used it for the annotation. We consider the root mean-square errors as the standard deviation of the annotation. As far as the extraction of Tempo information from songs is concerned, we used a VAMP plugin for the Sonic Annotator 8, which is based on [19]. We then manually corrected the wrongly estimated tempos. We used the Natural Language Processing Stanford parser [20] to analyze the query in a semantic tree. The Stanford parser is based on Probabilistic Context-Free Grammar (PCFG). Qualifiers are also checked in the -er form. Parser is also used in query-by-similarity to identify titles and authors. In order to make it robust against typos, we then used the Jaccard similarity metric [17] to compare found authors and titles with those in database. The system also perform a synonym analysis using Natural Language Toolkit (NLTK) [21] to be robust to missing terms in ED and NED concept spaces. The ranked list of retrieved songs is presented in a playlist form. IV. EXPERIMENTAL RESULTS AND EVALUATIONS The method that we propose in this study is based on a semantic description of songs based on a large set of concepts and qualifiers. Unfortunately, no ground truth is available and, due to the complexity of the model, producing one is a hard process that goes beyond the scope of this contribution. For this reason, the system has been evaluated through a subjective test. 53 tests have been collected in two phases. In the first phase testers were asked to rate the quality of the ranked list produced using five predefined queries. In the second phase, testers were asked to evaluate the general performances by the free use of the system. Evaluations were rated on a 9- point Likert scale. Testers were categorized according to their musical knowledge. No substantial differences have emerged between categories. This is why merged result are presented in this Section. A. Pre-defined queries We chose representative queries that are aimed at testing all the functionalities of the system. A summary of the evaluations is listed in Table IV. In general, subjects gave a positive evaluation: 4 of the 5 tests reached the mode value of 8 with asmallstandarddeviation. B. General evaluation An overview of the evaluation concerning the general performance of the system is shown in Table V. The mode of the rating is 7, agreeduponby46.67% of testers, while only 19% of the testers gave the experiments an evaluation below 5. The idea of a music search engine based on semantic natural language queries was widely appreciated: 32% of testers considered the system useful and they assigned a top mark of 9. 79% stated that they would use this kind of 8 Sonic Annotator http://www.omras2.org/sonicannotator

Query Mode Mean Std Iwantaverygroovyandhappysong 7 6.96 1.16 Iwantanothappyatall,dullandflowingsong 8 7.26 1.32 Iwantaplaylistthatsoundsangry,fastand 8 7.58 1.1 rough Iwouldliketolistentocalm,flowingandslow 8 7.53 1.45 songs like Orinoco Flow Iwantaplaylistnotangry,notstutteringand with a slow tempo 8 7.66 1.18 TABLE IV EVALUATION FOR THE PREDEFINED QUERIES Question Mode Mean Std Please indicate the general evaluation on the 7 6.32 1.38 results obtained when using free queries Do you think this system is useful? 9 7.49 1.42 Would you ever use this kind of system? 9 7 2.02 How do you evaluate the system in general? 7 7.15 1.08 TABLE V EVALUATION FOR THE GENERAL ASPECTS OF THE SYSTEM. system and, in particular, 26% assigned it a top rating. The standard deviation at 2.02 is explained by a certain reluctance in using and learning new tools. This conclusion is based on numerous collected comments, spontaneously left by testers. Finally, subjects were asked to provide a global evaluation of the system concerning the results, the idea, the functionalities, the usefulness and the potentials. 90% positively evaluates this work and its potentials. 7 is the mode agreed upon by 42% of testers. Subjects seemed positively impressed by this type of system. A histogram of collected evaluations is shown in Fig. 5 # occurrencies Fig. 5. 30 20 10 0 1 2 3 4 5 6 7 8 9 Rates Histogram of evaluation rates about the general concept of the system. V. CONCLUSIONS We proposed a music search engine based on textual natural language queries using emotional, non-emotional Description and semantic song similarity. We used a dimensional approach to terms conceptualization to provide a degree of intensity in music description. This allows us to use qualifiers to alter the semantics of related concepts. The adopted parsing solution relies on Natural Language Processing techniques. Concepts and songs are modeled as probability distribution in emotional and non-emotional space. The ranked list of songs is obtained by computing the final score for each song as the posterior probability, based on Bayesian Decision Theory. We finally collected subjective evaluations for a prototype of the system. Subjective tests returned good performance evaluations, with promising results for future developments. REFERENCES [1] H. H. Kim, A semantically enhanced tag-based music recommendation using emotion ontology, in Proceedings of the 5th Asian conference on Intelligent Information and Database Systems - Volume Part II, 2013, pp. 119 128. [2] A. L.-C. Wang, An industrial-strength audio search algorithm, in Proceedings of the 4 th International Conference on Music Information Retrieval, 2003. [3] M. Slaney, Semantic-audio retrieval, in 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 4, 2002. [4] D. Turnbull, L. Barrington, D. Torres, and G. Lanckriet, Towards musical query-by-semantic-description using the cal500 data set, in SIGIR 2007 Proceedings, Session 18: Music Retrieva, 2007. [5] R. Dale, H. L. Moisl, and H. L. Somers, Handbook of natural language processing. CRC Press, 2000. [6] J. A. Russell, A circumplex model of affect, Jornal of Personality and Social Psycholosy,, vol.39,no.6,p.11611178,1980. [7] L. Barrington, R. Oda, and G. Lanckriet, Smarter than genius? human evaluation of music recommender systems, in In Proc. International Symposium on Music Information Retrieval, 2009. [8] L. Chiarandini, M. Zanoni, and A. Sarti, A system for dynamic playlist generation driven by multimodal control signals and descriptors, in Multimedia Signal Processing (MMSP), 2011 IEEE 13th International Workshop on,2011. [9] M. M. Bradley and P. J. Lang, Affective norms for english words (anew): Instruction manual and affective ratings, NIMH Center for the Study of Emotion and Attention, Tech. Rep., 1999. [10] M. Lesaffre, L. D. Voogdt, M. Leman, B. D. Baets, H. D. Meyer, and J. P. Martens, How potential users of music search and retrieval systems describe the semantic quality of music, Journal of the american society for information science and technology, vol. 59, no. 5, pp. 697 707, 2008. [11] L. M. Zbikowski, Modelling the groove: Conceptual structure and popular music, Journal of the Royal Musical Association, vol. 129, no. 2, pp. 272 297, 2004. [12] J. Cu, R. Cabredo, R. Legaspi, and M. Suarez, On modelling emotional responses to rhythm features, PRICAI 2012: Trends in Artificial Intelligence - Lecture Notes in Computer Science, vol.7458,pp.857 860, 2012. [13] B. Rohrmann, Verbal qualifiers for rating scales: Sociolinguistic considerations and psychometric data, Project Report. University of Melbourne, Australia, Tech. Rep., 2003. [14] C. Strapparava and A. Valitutti, Wordnet-affect: an affective extension of wordnet, in Proceedings of LREC, vol.4,2004,pp.1083 1086. [15] G. A. Miller, Wordnet: A lexical database for english, vol. 38, no. 11, pp. 39 41, 1995. [16] Y. E. Kim, E. Schmidt, and L. Emelle, Moodswings: A collaborative game for music mood label collection, in Proceedings of the International Symposium on Music Information Retrieval, 2008,pp.231 236. [17] P.-N. Tan, Introduction to Data Mining. Addison-Wesley; 2 edition, 2013. [18] T. D. Schmidt, E. M. and Y. E. Kim, Feature selection for contentbased, time-varying musical emotion regression, in Proc. ACM SIGMM International Conference on Multimedia Information Retrieval, Philadelphia, PA, 2010. [19] M. E. Davies and M. D. Plumbley, Context-dependent beat tracking of musical audio, Audio, Speech, and Language Processing, IEEE Transactions on, vol.15,no.3,pp.1009 1020,2007. [20] D. Klein and C. D. Manning, Accurate unlexicalized parsing, in Proceedings of the 41st Annual Meeting on Association for Computational Linguistics-Volume 1. Association for Computational Linguistics, 2003, pp. 423 430. [21] E. Loper and S. Bird, Nltk: The natural language toolkit, in Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguistics- Volume 1. Association for Computational Linguistics, 2002, pp. 63 70.