A HMM-based Mandarin Chinese Singing Voice Synthesis System

Size: px
Start display at page:

Download "A HMM-based Mandarin Chinese Singing Voice Synthesis System"

Transcription

1 19 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 A HMM-based Mandarin Chinese Singing Voice Synthesis System Xian Li and Zengfu Wang Abstract We propose a mandarin Chinese singing voice synthesis system, in which hidden Markov model (HMM)-based speech synthesis technique is used. A mandarin Chinese singing voice corpus is recorded and musical contextual features are well designed for training. F0 and spectrum of singing voice are simultaneously modeled with context-dependent HMMs. There is a new problem, F0 of singing voice is always sparse because of large amount of context, i.e., tempo and pitch of note, key, time signature and etc. So the features hardly ever appeared in the training data cannot be well obtained. To address this problem, difference between F0 of singing voice and that of musical score (DF0) is modeled by a single Viterbi training. To overcome the over-smoothing of the generated F0 contour, syllable level F0 model based on discrete cosine transforms (DCT) is applied, F0 contour is generated by integrating two-level statistical models. The experimental results demonstrate that the proposed system outperforms the baseline system in both objective and subjective evaluations. The proposed system can generate a more natural F0 contour. Furthermore, the syllable level F0 model can make singing voice more expressive. Index Terms Singing voice synthesis, melisma, discrete cosine transform (DCT). I. INTRODUCTION SINGING voice is probably the first musical instrument as it exists prior to the invention of any other instrument. Singing voice synthesis, which enables computer to sing like a human, has been a subject of research for more than 50 years [1], and the quality which can be obtained now opens new perspectives. There are mainly two types of singing voice synthesis systems currently. One is concatenative singing voice synthesis (CSVS) [ 3]. A typical CSVS application is the vocaloid synthesizer [4]. CSVS concatenates sample units from a singing voice corpus, it succeeds in capturing the naturalness of the sound. However, the lack of flexibility and requirement of very large corpus are two of the main problems. The other one is hidden Markov model (HMM) based singing voice synthesis (HMMSVS). In recent years, HMM-based speech synthesis technique [5 6] developed rapidly, and has been applied to various applications such as singing voice synthesis. The main advantages of HMMSVS is flexibility in changing its voice characteristics. Besides, SPSVS also has a very small Manuscript received December, 014; accepted October 5, 015. Recommended by Associated Editor Liang Wang. Citation: Xian Li, Zengfu Wang. A HMM-based mandarin Chinese singing voice synthesis system. IEEE/CAA Journal of Automatica Sinica, 016, 3(): 19 0 Xian Li is with the Department of Automation, University of Science and Technology of China ( shysian@mail.ustc.edu.cn). Zengfu Wang is with the Institute of Intelligent Machines, Chinese Academy of Sciences ( zfwang@ustc.edu.cn). footprint. The first HMMSVS is Sinsy [7 10], which supports Japanese and English currently. For mandarin Chinese, there have been several work before. Zhou et al. [11] built a mandarin Chinese CSVS system. Li et al. [1] implemented F0 and spectrum modification module as back-end of text-to-speech system to synthesize singing voice. Gu et al. [13] used harmonic plus noise model (HNM) to design a scheme for synthesizing a mandarin Chinese singing voice. Recently, a mandarin Chinese HMMSVS was proposed [14]. We focus on mandarin Chinese HMMSVS in this paper. Although there was a mandarin Chinese HMMSVS before [14], our system has the following different features: 1) We solve the data sparse problem and handle the situation of melisma at the same time inside the HMM-based framework. ) A recent advance in speech synthesis technique, the multi level F0 model [15 17] is used to overcome over-smoothing of generated F0 contour. The major difference between read speech and singing voice is that singing should obey the pitch and rhythm of the musical score, and pitch and rhythm also have a great impact on the subjective quality of synthesized singing voice. In our system, precise rhythm is guaranteed by modeling time-lag using timing model [7]. Two methods are used to improve the F0 generation. One is single Viterbi training, after the conventional model training, Viterbi is performed to get state level alignment, and a single training is performed to model the difference between F0 of singing voice and that of musical score. This method can not only solve the data sparse problem, but also handle the situation of melisma. The other one is generating F0 with two levels. F0 of musical score is subtracted from F0 contour and then the residual F0 contour is parameterized by discrete cosine transforms (DCT) at syllable level. Context-dependent HMMs (CD-HMMs) are then trained for syllable DCT coefficients. At generation, state-level DF0 model and syllablelevel DCT model are integrated to generate F0 contour. The two level method can overcome over-smoothing and generate more expressive F0 contour. Besides, vibrato is also extracted and modeled [9] by CD-HMMs in our system. Objective and subjective evaluations are conducted to evaluate the performance of the proposed system. For objective evaluation, we define a new measurement called mean note distance (MND) based on the sense of music, MND is calculated by measuring the distance of each note pair in two F0 contours. It can measure whether the two sequences of note are in tune.

2 LI AND WANG: A HMM-BASED MANDARIN CHINESE SINGING VOICE SYNTHESIS SYSTEM 193 The rest of this paper is organized as follows: related previous work is presented in Section II, description of the proposed system is given in Section III, Section IV describes the proposed improved F0 models, Section V shows experimental results, and conclusions are given in Section VII. A. HMM-based Framework II. PREVIOUS WORK Fig. 1 shows the flowchart of HMM-based speech synthesis system. Fig. 1. Flowchart of the HMM-based speech synthesis system. In mandarin Chinese, each syllable consists of one initial and one final or only one final. And initial/final is usually used as model unit in speech synthesis. Each initial/final is modeled by a left-to-right and no skip structure HMM. At feature extraction stage, F0 and spectral features are extracted from waveforms. In addition to the static features, velocity and acceleration features are appended. Let X t = [x t, x t, x t ], x t = 0.5 (x t+1 x t 1 ), x t =.0 x t 1 + x t +.0 x t+1, and x = [x 1, x,..., x t ] T is static feature sequence, X = [X 1, X,..., X T ] T is observation feature sequence, X can be written as X = W x, where W is a matrix determined by the way of calculating velocity and acceleration features. In addition to acoustic features, contextual features are also extracted for training. During training, F0 and spectrum are modeled with multistream HMM, among them F0 stream is modeled with multispace probability distributions HMM [18]. A set of contextdependent HMMs λ are estimated by maximizing the likelihood function, and then a decision tree is built to cluster all HMM states based on contextual features using the minimum description length (MDL) criterion [19]. State duration model is also trained at this stage. During synthesis, label consisting of designed contextual features is obtained by linguistic analysis on the input text, state duration model is then used to determine state sequence. And then F0 and spectral parameters are generated by the maximum likelihood parameter generation (MLPG) algorithm [0] using state-level F0 and spectrum models. ˆx = arg max P (W x λ) x = arg max P (W x λ)p (q λ, l), (1) x where l is the input label, q = [q 1, q,..., q t ] is state sequence logp (W x λ) determined by l and duration model. By equating x to 0, speech parameters are generated by solving the linear equation (). Because W T Σ 1 W has a positive-definite bandsymmetric structure, this equation can be solved efficiently by using cholesky decomposition. W T U 1 W ˆx = W T U 1 M, () U = diag{σ q1,..., Σ qt }, (3) M = [µ q1,..., µ qt ] T, (4) where T is the number of frame, µ qt and Σ qt are the mean vector and covariance matrix of q t -th state respectively. At last, the mel log spectrum approximation (MLSA) [1] filter is used to synthesize voice with the generated parameters. B. HMM-based Singing Voice Synthesis The first HMM-based singing voice system is Sinsy [7 10], which supports Japanese and English currently. Sinsy made some modifications to HMM-based speech synthesis framework for the purpose of singing voice synthesis. First, duration model in speech synthesis is not appropriate for singing voice any more, because singing voice should obey the rhythm in the musical score. However, if duration strictly obey the musical score, the synthetic singing voice would be unnatural, because there is time-lag between start time of the notes on musical score and real singing voice. In Sinsy, timelag was modeled by Gaussian distribution. Second, F0 of singing voice data is always sparse because large amount of contexts, i.e., key, pitch of note etc. So HMMs that hardly ever appear in the training data cannot be well trained. Pitch pseudo training [8] has been proposed to address this problem, but there are several problems with pitch pseudo training, for example, the features included in a specific pitch range cannot be modeled since the pitch contexts are mixed due to pitch-shifted pseudo-data [10]. Pitch pseudo training is not a good solution, so other methods which model difference between F0 of singing voice and that of musical score have been proposed [10, ]. Data-level normalization method [] modeled the difference between F0 of singing voice and that of musical score by doing normalization before training. Although F0 is sparse, it is around the F0 value of the musical note, so the difference is not sparse and data sparse problem can be solved. The problem of data-level normalization method is it need to be trained by fixing the label, or there will be inconsistency between data and training. However, fixing the label, models cannot be well trained unless labels are perfectly right. In most cases, there exists errors in labels. Oura et al. [10] proposed pitch adaptive training which does F0 normalization at model level using speaker adaptive training technique [3]. µ i = B i ɛ = ˆµ i + b i, (5)

3 194 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 where b i is F0 value of note corresponding to the i-th state, the transformation matrix B i = [1, b i ] is fixed by musical sore, and ˆµ i is the only parameter need to be estimated. In this way, normalization of F0 is done at model level, there is no need to fix the label during training. Third, Sinsy also added a vibrato stream to synthesize vibrato [9]. In HMM-based framework, synthesized F0 contour is smooth, so vibrato cannot be synthesized. In Sinsy, F0 of singing voice was represented as a sum of two components, intonation component m(t), which corresponds to the melody of musical score, and vibrato component v(t), which is an almost sinusoidal modulation. As is shown in equation (6), v e (t) is vibrato extent and v r (t) is vibrato rate. F 0(t) = m(t) + v(t) ( ) πvr (t)t = m(t) + v e (t) cos, (6) where f s is sampling frequency of the F0 contour. Vibrato area was detected and vibrato parameters are extracted [4], and then vibrato parameters were modeled by context-dependent HMMs such as F0 and spectrum. Recently, a HMM-based mandarin singing voice synthesis system was proposed [14]. In this system, pitch pseudo training was used to address data sparse problem of F0, vibrato was synthesized at a constant vibrato rate during note. C. Problems In addition to the data sparse problem, there are two problems to be solved for F0. In mandarin Chinese singing voice, a syllable may contains one note or more than one notes. The situation of singing of a single syllable while moving between several different notes in succession is called melisma 1. The situation of melisma is not discussed in pitch adaptive training approach [10]. In Cheng s approach [14], melsima is synthesized by repeating the last final in previous words to mimic the singing skill. However, this approach, which is usually used in CSVS, has the problem of discontinuity. One of the drawback of HMM-based synthesis system is that generated parameter trajectory is usually over-smoothed, this is a more serious problem for singing voice, because F0 of singing voice is always more dynamic [5]. Higher-level F0 model based on DCT combined with state-level model [15 16] has been used to overcome the over-smoothing of generated F0 contour in speech synthesis. We apply syllable level DCT model to singing voice synthesis in this paper. III. DESCRIPTION OF THE PROPOSED SYSTEM A. Database Creation 1) Database design and recording. The songs for singing database were selected from children and traditional pop songs. Four criterions were used for the selection of songs: a) coverage of all syllables; b) balance of initial/final; c) coverage and balance of key; f s d) coverage and balance of tempo. After selecting songs, MIDI files were recorded through a MIDI keyboard connected to a computer. MIDIs and lyrics were then combined to save in the format of LilyPond [6], a professional music notation environment and software. Fig. shows a example of the musical score format. Fig.. An example of musical score format. A professional male singer was invited for recording the database. All the songs were recorded along with a metronome. The overview of the database is summarized in Table I. Item Singer TABLE I OVERVIEW OF THE DATABASE Details Professional male singer, 5 years old Number of songs 105 Total time Sample rate Resolution 13 minutes 48 khz 16 bits ) Contextual features. The synthetic singing voice should represent both precise lyric information and musical information, so both the above two should be included in contextual features. We design a four-level contextual features for our system: a) initial/final level: i) current initial/final; ii) preceding and succeeding intial/final. b) note level (or syllable level): i) absolute pitch; ii) relative pitch; iii) pitch difference between current note and preceding note; 1

4 LI AND WANG: A HMM-BASED MANDARIN CHINESE SINGING VOICE SYNTHESIS SYSTEM 195 iv) pitch difference between current note and succeeding note; v) length of preceding, current and succeeding note (thirtysecond note, millisecond); vi) position of current note in musical bar; vii) position of current note in musical phrase; viii) melisma or not, absolute pitch of the last note in melisma, difference between the pitch of the first note and the last note in melisma. c) phrase level: i) number of notes in current phrase; ii) length (thirty-second note); iii) length (millisecond). d) song level: i) key, time signature and tempo; ii) number of notes; iii) length (millisecond); iv) length(thirty-second note). For each musical score (lyrics included), we analysis it to obtain full contextual features. B. System Framework The framework of the proposed singing synthesis system is shown in Fig. 3. The modules in solid lines represent the baseline system, these are similar with the system described in Section II-A, except for an additional timing model and an additional vibrato stream. The modules in dash lines are the improved F0 models in the proposed system, details of these modules are described in Section IV. Initial/final is used as model unit, and modeled by HMM with a left-to-right and no skip structure. Vibrato is modeled simultaneously with F0 and stream using multi-stream HMMs. After conventional training, syllable alignment and state alignment are obtained using the trained models to align the training data. State alignment is then used to perform a single Viterbi training to model the difference between F0 of musical score and singing voice. Syllable alignment is used to train the timing model [7] to obtain a precise rhythm, and syllable level DCT F0 model is also trained. During synthesis, label consisting of contextual features is obtained by analysis on musical score. The timing model combined with state duration model are then used to determine the HMM sequence [7]. And then F0 and spectral parameters are generated. Details of the proposed F0 generation methods are described in Section IV. A. DF0 Model IV. IMPROVED F0 MODEL In the case of melisma, the model(initial/final) may range over several different notes, thus the transformation matrix in (5) cannot be fixed by musical score directly, so pitch adaptive training [10] is not suitable for the proposed system. Our method can handle the situation of melisma. As is shown in Fig. 4, the difference between F0 of singing voice and musical score is modeled. After the conventional model training: Fig. 3. Flowchart of the HMM-based singing voice synthesis system (The modules in solid lines represent the baseline system; the modules in dash lines is the improved F0 models in the proposed system; CD-HMMs means context-dependent HMMs).

5 196 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 1, n = 0, α n = 1, n = 1,..., N 1, (9) Fig. 4. Single Viterbi training, state sequence is aligned, and then DF0 is obtained and trained for each corresponding state. Each state may range over more than one notes at the situation of melisma. 1) Viterbi is performed to get state sequence. ) F0 vector of musical score is generated according to state sequence and musical score. F0 value of each is calculated by MIDINOT E 69 f = Duration of each note is determined by state sequence. In the case of melisma, duration of each notes in the melisma is calculated by d i = L i d, Li where d is duration of the whole note determined by state sequence, d i is calculated duration of i-th note, and L i is musical length (e.g., 1/4, 1/8) of i-th note. 3) The difference between F0 of singing voice and that of musical score (DF0) is obtained frame by frame. 4) There are two ways to model DF0. a) DF0-A: State-level alignment is obtained, so DF 0 can share the decision trees with F0, and is modeled by a gaussian model at each leaf node. b) DF0-B: DF0 is modeled separately with other parameters, under the HMM-based framework. It has its own decision tree and duration model. During generation, first, state sequence q is determined, and then the F0 vector of musical score M S can be generated according to musical score and state sequence. Here the vector M in () is the sum of two parts, M DF 0 and M S. M F 0 = M DF 0 + M S. (7) Finally, the MLPG algorithm is performed to generate F0 contour. This method is a special form of speaker adaptive training. It performs a single Viterbi training, and the transformation matrix is fixed by musical score and aligned state-level label simultaneously. B. Syllable-level Model 1) Discrete cosine transform for F0. DCT has been successfully used for modeling F0 in several languages [15 17] before. And it was also used for characterizing F0 of singing voice [7 8]. DCT is a linear, invertible function. It uses a sum of cosine functions to express a finite F0 contour. The Type-II DCT is used in this paper, c n = T T α 1 n t=0 f t cos { ( π T n t + 1 )}, n = 0, 1,..., N 1, (8) where f 0,..., f T 1 is a finite F0 contour of length T, and represented by N coefficients, c 0,..., c N 1. Similarly, the inverse DCT transformation (IDCT) is defined as f t = N 1 { ( π α n c n cos T n t + 1 )}, n=0 t = 0, 1,..., T 1. (10) The first coefficient of DCT is the mean of F0 contour and the rest are the weights for cosine functions with different frequencies. DCT has a strong energy compaction property, most of the signal information tends to be concentrated in a few low-frequency components of the DCT. In previous researches, 5 DCT coefficients were used to represent a syllable F0 contour of neutral speech [16], and 8 DCT coefficients were used for emotional mandarin speech [17]. Considering that a syllable in singing voice is always longer than speech, and there also exists more dynamic [5], so more coefficients may be needed for singing voice. The number of DCT coefficients is discussed in Section V-D. ) DCT-based syllable-level F0 model. In syllable level F0 model, sparseness of F0 is also a problem, so DCT is performing on residual F0 contour. Fig. 5 shows the procedure. Firstly, F0 of musical score is subtracted from F0 contour, and then the residual F0 contour is parameterized by DCT at syllable level. Fig. 6 shows basic shapes of syllable F0 contour. During training, context-dependent HMMs λ s are trained for syllable DCTs, the same contextual factors with state level models are used for decision tree clustering. C. Generation Integrating Two Levels During generation, DF0 model and syllable-level DCT model are integrated to generate F0 contour [16], ˆf = arg max P (W f λ)p (D(f f score ) λ s ) α, (11) f where f is static F0 sequence, and f score is F0 value of note sequence on musical score, D is the matrix for calculating DCT coefficients from F0 contour, structure of D is shown in (16), α is weight of syllable level model. Solving (11), the MLPG algorithm combined two level can be written as (W T U 1 W + αd T U 1 s D)ˆf = W T U 1 M F 0 + αd T U 1 s M s, (1) U s = diag{σ s,q1,..., Σ s,qs }, (13) M s = [µ s,q1,..., µ s,qs ] T + Df score, (14)

6 LI AND WANG: A HMM-BASED MANDARIN CHINESE SINGING VOICE SYNTHESIS SYSTEM 197 D = T cos ( π cos T ( )) 1. ( π (N 1) T ( )) 1 ( ( π cos )) T. ( ( π cos (N 1) )) T ( π... cos T ( π... cos (N 1) T ( T )) ( T 1 + 1, (15) )) Fig. 5. Procedure of performing DCT on syllable level F0 contour. where S is the number of syllable, µ s,qs and Σ s,qs are the mean vector and covariance matrix of the q s -th syllable respectively. An informal listening test is conducted to decide the value of α. At last, α is set to be 3.0T NS, which is 3.0 times the ratio of the number of dimensions between M and M s. A. Experimental Conditions V. EXPERIMENTS The corpus presented in Section III-A was used in our experiments. The database was randomly divided into two parts, training and testing sets. The total length of training set was 115 minutes. And the total length of testing set was 17 minutes. 34-order MGCs [9], 35 parameters including energy component, were extracted from speech signals by performing spectral analysis with a 5 ms Hamming window, shifted every 5 ms. F0 was extracted by get_f0 method in Snack [30] without manual corrections. logf0 value was used for F0 feature. vibrato was extracted [4] and vibrato component was subtracted from F0 contour. After appending the dynamic features, the logf0, vibrato and MGC features consisting of static, velocity, and acceleration components. 7-state, left-to-right multi-stream hidden semi-markov model (HSMM) [31] was used. The MGC stream was modeled with a single multivariate Gaussian distributions. The F0 stream and vibrato stream was modeled with multi-space probability distributions HSMM [18], a Gaussian distribution for voiced/vibrato frames and a discrete distribution for unvoiced/non-vibrato frames. HTS Toolkit [3] was used to build the systems. A comparison of F0 generation method of baseline system and the proposed system is summarized in Table II. Vibrato was modeled but not used at F0 generation in system baseline, DF0-A, DF0-B and DCT. The difference between DF0-A and DF0-B has been presented in Section IV-A. System VIB was the system DCT with vibrato component added. In all the systems, vibrato component was subtracted from F0 contour, and intonational component was used for F0 stream. We used flat start uniformed labels to initially trained the baseline model, which was then used to do forced alignments

7 198 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 Fig. 6. Basic shapes of syllable F0 contour, K-means is performed on DCT coefficients, and then F0 contour is reconstructed from DCT coefficients of centroid of each cluster, C is the number of clusters. TABLE II THE SYSTEMS USED IN EXPERIMENTS Name of system Description of method Baseline MLPG (Equation ()) DF0-A Equation (7) + MLPG DF0-B Be the same as DF0-A s DCT Two levels integrating (Equation (1)) VIB DCT + vibrato on the training data to obtain syllable and state alignment. After that, timing model[7] was trained for precise rhythm. As described in Section IV, DF0 model was trained with the help of state alignment, and DCT parameterized syllable F0 model was also trained using 1-state HMM model with the help of syllable alignment. It should be noted that DCT was performed on intonation component of F0. At synthesis stage, state duration was determined according to timing model and state duration model. And state duration were kept to be the same for all the systems in evaluation. The MLSA filter was used as synthesis filter. B. Model Structure Minimum description length (MDL) criterion[19] was used as stopping criterion in decision growing. MDL factor was set to be 1. Table III shows the number of leaf nodes in the decision trees: state level spectrum, F0, vibrato and syllable level DCT models. The number of leaf nodes of F0 and MGC are similar with the result in Cheng s work[14]. The number of leaf nodes of F0 and DF0 is very near. The number of leaf nodes of state level models is larger than the result in Sinsy[9], because more training data was used in the proposed system and language was also different. There had been no syllable level DCT model for F0 in singing synthesis before. The number of leaf nodes is slightly smaller than that in mandarin Chinese speech synthesis[16], this is reasonable because residual F0 contour was modeled here.

8 LI AND WANG: A HMM-BASED MANDARIN CHINESE SINGING VOICE SYNTHESIS SYSTEM 199 TABLE III NUMBER OF LEAF NODES IN THE SYSTEMS Feature Number of leaf nodes F DF0 (in system DF0-B) 3997 MGC 3139 Vibrato 1678 Syllable DCT 101 C. Metric In addition to the root mean square error (RMSE) and correlation measurement of F0, which are commonly used in speech synthesis. We used another measurement based on the sense of music. In music theory, an interval is the difference between two pitches, and the standard system for comparing interval sizes is with cents. If one knows the frequencies a and b of two notes, the number of cents measuring the interval from a to b may be calculated by the following formula (similar to the definition of decibel): cent(b, a) = 100log ( fb f a ), (16) where f b and f a are frequency in Hz. We define a measurement called mean note distance (MND) based on cent, MND[cent] = Q ND(q) q Q, (17) cent(b i, a i ) i ND(q)[cent] =, (18) I q where Q is the number of notes, ND(q) means the distance between the q-th note of two F0 contours, I q is the number of frames in the q-th note, b i and a i are the values (Hz) of the i-th point in the q-th note. MND is calculated on the unit of note, it can measure whether the two sequences of note are in tune, so it can measure the distance of two F0 contours in the sense of music. D. Number of DCT Coefficients To determine the number of DCT coefficients, intonation component was reconstructed from DCT coefficients and compared with the original intonation component. Fig. 7 shows RMSE, correlation and MND on the reconstructed F0 contour and original F0 contour. Because vibrato detection was not perfectly right, there still existed vibrato in the intonation component. The residual vibrato may influenced these measurements, e.g., MND to be a little higher, because vibrato extent is about cents. When the number of DCT coefficients goes to 10, RMSE is rather small, correlation is 0.99 and MND is nearly 5, which is a twentieth of a semitone that is very hard to be distinguished by human, so we set the number of DCT coefficients to be 10 in the rest of the experiments. Fig. 7. (a) RMSE, (b) correlation, and (c) MND as a function of number of DCT coefficients. E. Subjective Evaluations Subjective listening tests were conducted to evaluate the naturalness of the synthetic singing voice. Ten segments were randomly selected from the test set were used for the evaluation. Average length of these segments was 30 seconds. Examples of the test segments are available at shysian/singingsynthesis/index.html. Two subjective listening tests were conducted with a webbase interface. The first listening test was to evaluate the naturalness of the systems. Eight subjects were asked to rate the naturalness of the synthetic singing voice on a mean opinion score (MOS) with a scale from 1 (poor) to 5 (good). All the segments were in random order. Fig. 8 shows that the proposed systems significantly outperform the baseline system. This is because whether the synthetic F0 contour is in tune has a great impact on the subjective quality. The proposed systems can generate an in tune F0 contour, while the baseline system cannot because of data sparseness problem. DF0-A method is better than DF0-B method. That is maybe because DF0-A method generates DF0 from the same state sequence with other parameters (spectrum, vibrato), while DF0-B method generates DF0 from a different state sequence with other parameters. So in all of our experiments, DF0-A was used as state-level model for DCT method. DCT method is better than DF0-A method. According to our observation, DCT method generates more expressive F0 contour. We also see that the addition of vibrato slightly increase the naturalness. In the system, vibrato was generated at frame level, it is relation to note is ignored. Although vibrato can increase the expressiveness, sometimes the generated vibrato is a little unnatural, that is why subjects did not give a significantly higher score to system VIB compared with system DCT. The second listening test was an AB preference test to compare the expressiveness among the proposed systems. Eight subjects were asked to select from three preference

9 00 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 choices: 1) the former is better (more expressive); ) the latter is better; 3) no preference. All the segments were in random order. Fig. 9 shows that The syllable DCT model generated more expressive F0 contour than DF0-A method. And the addition of vibrato furtherly improved the expressiveness. Fig. 10 shows the synthesized F0 contour of a melisma, a single syllable a ranging over three different notes (MIDI NOTEs: 6, 60, 59). The solid line indicates the F0 contour of musical score, and the broken lines indicate the F0 contour generated by the proposed method and baseline method. Both two proposed methods generated better F0 contours than the baseline method. F0 value was converted to MIDI note by (0). F0 MIDINOT E = log (19) 440 Hz Fig. 8. Mean opinion score of the baseline and the proposed systems, the confidence interval is Fig. 10. Synthesized F0 contour of a melisma, a single syllable a ranging over three different notes (MIDI NOTEs: 6, 60, 59), F0 value was converted to MIDI note. The solid line indicates the F0 contour of musical score, and the broken lines indicate the F0 contour generated by the proposed method and baseline method. Fig. 9. Results of AB preference test for expressiveness evaluation, NOP means no preference. F. Effectiveness of Improved F0 Model Table IV shows comparison of the objective measurements calculated on the test set for the baseline system and the proposed system. These measurements were calculated between the original F0 contour (intonation component) in the test set and the generated F0 contour (intonation component). The results show that the proposed methods outperform the baseline method in all the three measurements. TABLE IV COMPARISON OF OBJECTIVE MEASUREMENTS System RMSE of logf0 MND (cent) Correlation Baseline DF0-A DF0-B DCT Fig. 11 shows a comparison of F0 contours generated by the baseline system and the proposed system. The F0 value was converted to MIDI note. The solid line indicates the F0 contour of musical score, and the broken lines indicate the F0 contour generated by the proposed method and baseline method. The two F0 contours generated by the proposed methods (DF0- A and syllable DCT) is in tune with the musical score, while the one generated by the baseline method is out of tune at some musical notes. Both DF0 and syllable DCT method can alleviate the data sparse problem to generate an in tune F0 contour. The syllable DCT method also generated a more expressive F0 contour than state DF0. Especially, preparation and overshoot of F0, which is observed during note transition [5], was successfully synthesized by the syllable DCT method. Fig. 1 shows an example of generated F0 contour with or without vibrato component. System VIB successfully generated a time-variant vibrato, which looks like an almost sinusoidal modulation. As can be seen from Figs. 10-1, the generated F0 was a little lower than the musical score, this is because the singer had a tendency to sing a little flat. The same phenomenon has been observed in Sinsy [7]. Table V shows statistics of variance of generated F0 in system DF0-A and DCT. We can see that variance of F0 was larger in system DCT than in system DF0-A. Over-smoothness was partly compensated by integrating statistical models of two levels.

10 LI AND WANG: A HMM-BASED MANDARIN CHINESE SINGING VOICE SYNTHESIS SYSTEM 01 Fig. 11. F0 contour of synthetic singing voice compared with that of musical score, F0 value was converted to MIDI note. The solid line indicates the F0 contour of musical score, and the broken lines indicate the F0 contour generated by the proposed method and baseline method. VII. CONCLUSION We have proposed a HMM-based mandarin Chinese singing voice synthesis system. A mandarin Chinese singing voice corpus was recorded and musical contextual features were well designed for training. We solve the data sparse problem and handle the situation of melisma at the same time inside the HMM-based framework. Discrete cosine transform (DCT) F0 model is also apllied, and two level statistical models are integrated for generation to overcome over-smoothing of generated F0 contour. Objective and subjective evaluations showed that our system can generate a natural and in tune" F0 contour. Furthermore, the method integrating two level statistical models successfully made singing voice more expressive. As we can see from experimental results, modeling of vibrato can improve the naturalness and expressiveness of the singing voice. In this paper, vibrato was generated at state level, its relation to the note was not explicitly considered. Future work will focus on modeling vibrato by considering its relation to note explicitly. The quality of the synthetic voice is expected to rise if better spectrum extraction method is used. So another future work is using better spectrum estimator to substitute the conventional mel-cepstral analysis. REFERENCES [1] Cook P R. Singing voice synthesis: history, current work, and future directions. Computer Music Journal, 1996, 0(3): Fig. 1. An example of generated F0 contour with or without vibrato component. F0 value was in log value. The solid line indicates the F0 contour of musical score, and the broken lines indicate the generated F0 contour with or without vibratod. TABLE V VARIANCE OF GENERATED F0 System Variance (confidence interval: 0.95) DF0-A ± DCT ± VI. RELATION TO PREVIOUS WORK Our DF0-A method is a special form of pitch adaptive training method, it only perform a single Viterbi alignment, and then difference between F0 and musical score is modeled. By doing this, it can handle the situation of melisma. It also benefits from not doing pitch normalization before training, So that DF0 and other parameters (spectrum, vibrato and etc) can be estimated and generated with the same state sequence (synchronous state). DF0-B method is actually very similar with data-level pitch normalization method, the difference is that DF0 data is obtained through a more accurate alignment. In DF0-B method, DF0 model parameter is estimated and DF0 parameter is generated from different state sequences (asynchronous state) with other acoustic parameters. [] Bonada J, Serra X. Synthesis of the singing voice by performance sampling and spectral models. IEEE Signal Processing Magazine, 007, 4(): [3] Bonada J. Voice Processing and Synthesis by Performance Sampling and Spectral Models [Ph. D. dissertation], Universitat Pompeu Fabra, Barcelona, 008. [4] Kenmochi H, Ohshita H. VOCALOID-commercial singing synthesizer based on sample concatenation. In: Proceedings of the 8th Annual Conference of the International Speech Communication Association. Antwerp, Belgium, [5] Ling Z H, Wu Y J, Wang Y P, Qin L, Wang R H. USTC system for blizzard challenge 006 an improved HMM-based speech synthesis method. In: Blizzard Challenge Workshop. Pittsburgh, USA, 006. [6] Zen H G, Tokuda K, Black A W. Statistical parametric speech synthesis. Speech Communication, 009, 51(11): [7] Saino K, Zen H G, Nankaku Y, Lee A, Tokuda K. An HMM-based singing voice synthesis system. In: Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, PA, USA, 006. [8] Mase A, Oura K, Nankaku Y, Tokuda K. HMM-based singing voice synthesis system using pitch-shifted pseudo training data. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association. Makuhari, Chiba, Japan, [9] Oura K, Mase A, Yamada T, Muto S, Nankaku Y, Tokuda K. Recent development of the HMM-based singing voice synthesis system Sinsy. In: Proceedings of the 010 ICASSP. Kyoto, Japan, [10] Oura K, Mase A, Nankaku Y, Tokuda K. Pitch adaptive training for HMM-based singing voice synthesis. In: Proceedings of the 01 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). Kyoto: IEEE,

11 0 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 3, NO., APRIL 016 [11] Zhou S S, Chen Q C, Wang D D, Yang X H. A corpus-based concatenative mandarin singing voice synthesis system. In: Proceedings of the 008 International Conference on Machine Learning and Cybernetics. Kunming, China: IEEE, [1] Li J L, Yang H W, Zhang W Z, Cai L H. A lyrics to singing voice synthesis system with variable timbre. In: Proceedings of the 011 International Conference, Applied Informatics, and Communication. Xi an, China: Springer, [13] Gu H Y, Liau H L. Mandarin singing voice synthesis using an HNM based scheme. In: Proceedings of the 008 Congress on Image and Signal Processing. Sanya, China: IEEE, [14] Cheng J Y, Huang Y C, Wu C H. HMM-based mandarin singing voice synthesis using tailored synthesis units and question sets. Computational Linguistics and Chinese Language Processing, 013, 18(4): [15] Latorre J, Akamine M. Multilevel parametric-base F0 model for speech synthesis. In: Proceedings of the 9th Annual Conference of the International Speech Communication Association. Brisbane, Australia, [16] Qian Y, Wu Z Z, Gao B Y, Soong F K. Improved prosody generation by maximizing joint probability of state and longer units. IEEE Transactions on Audio, Speech, and Language Processing, 011, 19(6): [17] Li X, Yu J, Wang Z F. Prosody conversion for mandarin emotional voice conversion. Acta Acustica, 014, 39(4): (in Chinese) [18] Tokuda K, Masuko T, Miyazaki N, Kobayashi T. Hidden Markov models based on multi-space probability distribution for pitch pattern modeling. In: Proceedings of the 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Phoenix, AZ: IEEE, [19] Shinoda K, Watanabe T. MDL-based context-dependent subword modeling for speech recognition. The Journal of the Acoustical Society of Japan (E), 000, 1(): [0] Tokuda K, Yoshimura T, Masuko T, Kobayashi T, Kitamura T. Speech parameter generation algorithms for HMM-based speech synthesis. In: Proceedings of the 000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Istanbul: IEEE, [1] Imai S, Sumita K, Furuichi C. Mel log spectrum approximation (MLSA) filter for speech synthesis. Electronics and Communications in Japan (Part I: Communications), 1983, 66(): [] Saino K, Tachibana M, Kenmochi H. An HMM-based singing style modeling system for singing voice synthesizers. In: Proceedings of the 7th ISCA Workshop on Speech Synthesis, 010. [3] Yamagishi J, Kobayashi T. Average-voice-based speech synthesis using HSMM-based speaker adaptation and adaptive training. IEICE- Transactions on Information and Systems, 007, E90-D(): unknown melodies using pitch interval accuracy and vibrato features. In: Proceedings of the 9th International Conference on Spoken Language Processing. Pittsburgh, PA, USA, [5] Saitou T, Unoki M, Akagi M. Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis. Speech Communication, 005, 46(3 4): [6] LilyPond [Online], available: [7] Devaney J C, Mandel M I, Fujinaga I. Characterizing singing voice fundamental frequency trajectories. In: Proceedings of the 011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. New Paltz, NY: IEEE, [8] Lee S W, Dong M H, Li H Z. A study of F0 modelling and generation with lyrics and shape characterization for singing voice synthesis. In: Proceedings of the 8th International Symposium on Chinese Spoken Language Processing. Kowloon: IEEE, [9] Koishida K, Tokuda K, Kobayashi T, Imai S. CELP coding based on melcepstral analysis. In: Proceedings of the 1995 International Conference on Acoustics, Speech, and Signal Processing. Detroit, MI: IEEE, [30] Snack [Online], available: [31] Zen H G, Tokuda K, Masuko T, Kobayashi T, Kitamura T. Hidden semi-markov model based speech synthesis. In: Proceedings of the 8th International Conference on Spoken Language Processing. Jeju Island, Korea, [3] HMM-based speech synthesis system (HTS) [Online], available: hts.sp.nitech.ac.jp Xian Li Ph. D. candidate in the Department of Automation, University of Science and Technology of China. His research interests include singing and speech synthesis. Zengfu Wang Professor at Institute of Intelligent Machines, Chinese Academy of Sciences. His research interests include computer vision, human computer interaction, and intelligent robots. Corresponding author of this paper. [4] Nakano T, Goto M. An automatic singing skill evaluation method for

Singing voice synthesis based on deep neural networks

Singing voice synthesis based on deep neural networks INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Singing voice synthesis based on deep neural networks Masanari Nishimura, Kei Hashimoto, Keiichiro Oura, Yoshihiko Nankaku, and Keiichi Tokuda

More information

Advanced Signal Processing 2

Advanced Signal Processing 2 Advanced Signal Processing 2 Synthesis of Singing 1 Outline Features and requirements of signing synthesizers HMM based synthesis of singing Articulatory synthesis of singing Examples 2 Requirements of

More information

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016 Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016 Jordi Bonada, Martí Umbert, Merlijn Blaauw Music Technology Group, Universitat Pompeu Fabra, Spain jordi.bonada@upf.edu,

More information

1. Introduction NCMMSC2009

1. Introduction NCMMSC2009 NCMMSC9 Speech-to-Singing Synthesis System: Vocal Conversion from Speaking Voices to Singing Voices by Controlling Acoustic Features Unique to Singing Voices * Takeshi SAITOU 1, Masataka GOTO 1, Masashi

More information

A Comparative Study of Spectral Transformation Techniques for Singing Voice Synthesis

A Comparative Study of Spectral Transformation Techniques for Singing Voice Synthesis INTERSPEECH 2014 A Comparative Study of Spectral Transformation Techniques for Singing Voice Synthesis S. W. Lee 1, Zhizheng Wu 2, Minghui Dong 1, Xiaohai Tian 2, and Haizhou Li 1,2 1 Human Language Technology

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Bertsokantari: a TTS based singing synthesis system

Bertsokantari: a TTS based singing synthesis system INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Bertsokantari: a TTS based singing synthesis system Eder del Blanco 1, Inma Hernaez 1, Eva Navas 1, Xabier Sarasola 1, Daniel Erro 1,2 1 AHOLAB

More information

Music Segmentation Using Markov Chain Methods

Music Segmentation Using Markov Chain Methods Music Segmentation Using Markov Chain Methods Paul Finkelstein March 8, 2011 Abstract This paper will present just how far the use of Markov Chains has spread in the 21 st century. We will explain some

More information

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods Kazuyoshi Yoshii, Masataka Goto and Hiroshi G. Okuno Department of Intelligence Science and Technology National

More information

Music Radar: A Web-based Query by Humming System

Music Radar: A Web-based Query by Humming System Music Radar: A Web-based Query by Humming System Lianjie Cao, Peng Hao, Chunmeng Zhou Computer Science Department, Purdue University, 305 N. University Street West Lafayette, IN 47907-2107 {cao62, pengh,

More information

Query By Humming: Finding Songs in a Polyphonic Database

Query By Humming: Finding Songs in a Polyphonic Database Query By Humming: Finding Songs in a Polyphonic Database John Duchi Computer Science Department Stanford University jduchi@stanford.edu Benjamin Phipps Computer Science Department Stanford University bphipps@stanford.edu

More information

Supervised Learning in Genre Classification

Supervised Learning in Genre Classification Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music

More information

2. AN INTROSPECTION OF THE MORPHING PROCESS

2. AN INTROSPECTION OF THE MORPHING PROCESS 1. INTRODUCTION Voice morphing means the transition of one speech signal into another. Like image morphing, speech morphing aims to preserve the shared characteristics of the starting and final signals,

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC G.TZANETAKIS, N.HU, AND R.B. DANNENBERG Computer Science Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh, PA 15213, USA E-mail: gtzan@cs.cmu.edu

More information

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC A Thesis Presented to The Academic Faculty by Xiang Cao In Partial Fulfillment of the Requirements for the Degree Master of Science

More information

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007 AN HMM BASED INVESTIGATION OF DIFFERENCES BETWEEN MUSICAL INSTRUMENTS OF THE SAME TYPE PACS: 43.75.-z Eichner, Matthias; Wolff, Matthias;

More information

Robert Alexandru Dobre, Cristian Negrescu

Robert Alexandru Dobre, Cristian Negrescu ECAI 2016 - International Conference 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Automatic Music Transcription Software Based on Constant Q

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

Transcription of the Singing Melody in Polyphonic Music

Transcription of the Singing Melody in Polyphonic Music Transcription of the Singing Melody in Polyphonic Music Matti Ryynänen and Anssi Klapuri Institute of Signal Processing, Tampere University Of Technology P.O.Box 553, FI-33101 Tampere, Finland {matti.ryynanen,

More information

A Music Retrieval System Using Melody and Lyric

A Music Retrieval System Using Melody and Lyric 202 IEEE International Conference on Multimedia and Expo Workshops A Music Retrieval System Using Melody and Lyric Zhiyuan Guo, Qiang Wang, Gang Liu, Jun Guo, Yueming Lu 2 Pattern Recognition and Intelligent

More information

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS Mutian Fu 1 Guangyu Xia 2 Roger Dannenberg 2 Larry Wasserman 2 1 School of Music, Carnegie Mellon University, USA 2 School of Computer

More information

Speech and Speaker Recognition for the Command of an Industrial Robot

Speech and Speaker Recognition for the Command of an Industrial Robot Speech and Speaker Recognition for the Command of an Industrial Robot CLAUDIA MOISA*, HELGA SILAGHI*, ANDREI SILAGHI** *Dept. of Electric Drives and Automation University of Oradea University Street, nr.

More information

Subjective Similarity of Music: Data Collection for Individuality Analysis

Subjective Similarity of Music: Data Collection for Individuality Analysis Subjective Similarity of Music: Data Collection for Individuality Analysis Shota Kawabuchi and Chiyomi Miyajima and Norihide Kitaoka and Kazuya Takeda Nagoya University, Nagoya, Japan E-mail: shota.kawabuchi@g.sp.m.is.nagoya-u.ac.jp

More information

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models Kyogu Lee Center for Computer Research in Music and Acoustics Stanford University, Stanford CA 94305, USA

More information

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION ULAŞ BAĞCI AND ENGIN ERZIN arxiv:0907.3220v1 [cs.sd] 18 Jul 2009 ABSTRACT. Music genre classification is an essential tool for

More information

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis Fengyan Wu fengyanyy@163.com Shutao Sun stsun@cuc.edu.cn Weiyao Xue Wyxue_std@163.com Abstract Automatic extraction of

More information

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE

AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE 1th International Society for Music Information Retrieval Conference (ISMIR 29) AUTOMATIC IDENTIFICATION FOR SINGING STYLE BASED ON SUNG MELODIC CONTOUR CHARACTERIZED IN PHASE PLANE Tatsuya Kako, Yasunori

More information

MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS

MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS MODELING OF PHONEME DURATIONS FOR ALIGNMENT BETWEEN POLYPHONIC AUDIO AND LYRICS Georgi Dzhambazov, Xavier Serra Music Technology Group Universitat Pompeu Fabra, Barcelona, Spain {georgi.dzhambazov,xavier.serra}@upf.edu

More information

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function EE391 Special Report (Spring 25) Automatic Chord Recognition Using A Summary Autocorrelation Function Advisor: Professor Julius Smith Kyogu Lee Center for Computer Research in Music and Acoustics (CCRMA)

More information

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng S. Zhu, P. Ji, W. Kuang and J. Yang Institute of Acoustics, CAS, O.21, Bei-Si-huan-Xi Road, 100190 Beijing,

More information

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM A QUER B EAMPLE MUSIC RETRIEVAL ALGORITHM H. HARB AND L. CHEN Maths-Info department, Ecole Centrale de Lyon. 36, av. Guy de Collongue, 69134, Ecully, France, EUROPE E-mail: {hadi.harb, liming.chen}@ec-lyon.fr

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

THE importance of music content analysis for musical

THE importance of music content analysis for musical IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 15, NO. 1, JANUARY 2007 333 Drum Sound Recognition for Polyphonic Audio Signals by Adaptation and Matching of Spectrogram Templates With

More information

A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION. Sudeshna Pal, Soosan Beheshti

A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION. Sudeshna Pal, Soosan Beheshti A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION Sudeshna Pal, Soosan Beheshti Electrical and Computer Engineering Department, Ryerson University, Toronto, Canada spal@ee.ryerson.ca

More information

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models Aric Bartle (abartle@stanford.edu) December 14, 2012 1 Background The field of composer recognition has

More information

Recognising Cello Performers Using Timbre Models

Recognising Cello Performers Using Timbre Models Recognising Cello Performers Using Timbre Models Magdalena Chudy and Simon Dixon Abstract In this paper, we compare timbre features of various cello performers playing the same instrument in solo cello

More information

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing Universal Journal of Electrical and Electronic Engineering 4(2): 67-72, 2016 DOI: 10.13189/ujeee.2016.040204 http://www.hrpub.org Investigation of Digital Signal Processing of High-speed DACs Signals for

More information

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices

On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices On Human Capability and Acoustic Cues for Discriminating Singing and Speaking Voices Yasunori Ohishi 1 Masataka Goto 3 Katunobu Itou 2 Kazuya Takeda 1 1 Graduate School of Information Science, Nagoya University,

More information

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM

AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM AN ON-THE-FLY MANDARIN SINGING VOICE SYNTHESIS SYSTEM Cheng-Yuan Lin*, J.-S. Roger Jang*, and Shaw-Hwa Hwang** *Dept. of Computer Science, National Tsing Hua University, Taiwan **Dept. of Electrical Engineering,

More information

Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm

Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm Singing voice synthesis in Spanish by concatenation of syllables based on the TD-PSOLA algorithm ALEJANDRO RAMOS-AMÉZQUITA Computer Science Department Tecnológico de Monterrey (Campus Ciudad de México)

More information

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine Project: Real-Time Speech Enhancement Introduction Telephones are increasingly being used in noisy

More information

TERRESTRIAL broadcasting of digital television (DTV)

TERRESTRIAL broadcasting of digital television (DTV) IEEE TRANSACTIONS ON BROADCASTING, VOL 51, NO 1, MARCH 2005 133 Fast Initialization of Equalizers for VSB-Based DTV Transceivers in Multipath Channel Jong-Moon Kim and Yong-Hwan Lee Abstract This paper

More information

Phone-based Plosive Detection

Phone-based Plosive Detection Phone-based Plosive Detection 1 Andreas Madsack, Grzegorz Dogil, Stefan Uhlich, Yugu Zeng and Bin Yang Abstract We compare two segmentation approaches to plosive detection: One aproach is using a uniform

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

Music Alignment and Applications. Introduction

Music Alignment and Applications. Introduction Music Alignment and Applications Roger B. Dannenberg Schools of Computer Science, Art, and Music Introduction Music information comes in many forms Digital Audio Multi-track Audio Music Notation MIDI Structured

More information

An Accurate Timbre Model for Musical Instruments and its Application to Classification

An Accurate Timbre Model for Musical Instruments and its Application to Classification An Accurate Timbre Model for Musical Instruments and its Application to Classification Juan José Burred 1,AxelRöbel 2, and Xavier Rodet 2 1 Communication Systems Group, Technical University of Berlin,

More information

Analysis, Synthesis, and Perception of Musical Sounds

Analysis, Synthesis, and Perception of Musical Sounds Analysis, Synthesis, and Perception of Musical Sounds The Sound of Music James W. Beauchamp Editor University of Illinois at Urbana, USA 4y Springer Contents Preface Acknowledgments vii xv 1. Analysis

More information

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University Week 14 Query-by-Humming and Music Fingerprinting Roger B. Dannenberg Professor of Computer Science, Art and Music Overview n Melody-Based Retrieval n Audio-Score Alignment n Music Fingerprinting 2 Metadata-based

More information

On human capability and acoustic cues for discriminating singing and speaking voices

On human capability and acoustic cues for discriminating singing and speaking voices Alma Mater Studiorum University of Bologna, August 22-26 2006 On human capability and acoustic cues for discriminating singing and speaking voices Yasunori Ohishi Graduate School of Information Science,

More information

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt ON FINDING MELODIC LINES IN AUDIO RECORDINGS Matija Marolt Faculty of Computer and Information Science University of Ljubljana, Slovenia matija.marolt@fri.uni-lj.si ABSTRACT The paper presents our approach

More information

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases *

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 31, 821-838 (2015) Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases * Department of Electronic Engineering National Taipei

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox Final Project (EECS 94) knoxm@eecs.berkeley.edu December 1, 006 1 Introduction Laughter is a powerful cue in communication. It communicates to listeners the emotional

More information

Audio-Based Video Editing with Two-Channel Microphone

Audio-Based Video Editing with Two-Channel Microphone Audio-Based Video Editing with Two-Channel Microphone Tetsuya Takiguchi Organization of Advanced Science and Technology Kobe University, Japan takigu@kobe-u.ac.jp Yasuo Ariki Organization of Advanced Science

More information

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions

An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions 1128 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 11, NO. 10, OCTOBER 2001 An Efficient Low Bit-Rate Video-Coding Algorithm Focusing on Moving Regions Kwok-Wai Wong, Kin-Man Lam,

More information

SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG. Sangeon Yong, Juhan Nam

SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG. Sangeon Yong, Juhan Nam SINGING EXPRESSION TRANSFER FROM ONE VOICE TO ANOTHER FOR A GIVEN SONG Sangeon Yong, Juhan Nam Graduate School of Culture Technology, KAIST {koragon2, juhannam}@kaist.ac.kr ABSTRACT We present a vocal

More information

VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION

VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION VOCALISTENER: A SINGING-TO-SINGING SYNTHESIS SYSTEM BASED ON ITERATIVE PARAMETER ESTIMATION Tomoyasu Nakano Masataka Goto National Institute of Advanced Industrial Science and Technology (AIST), Japan

More information

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES Jun Wu, Yu Kitano, Stanislaw Andrzej Raczynski, Shigeki Miyabe, Takuya Nishimoto, Nobutaka Ono and Shigeki Sagayama The Graduate

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

Pitch Analysis of Ukulele

Pitch Analysis of Ukulele American Journal of Applied Sciences 9 (8): 1219-1224, 2012 ISSN 1546-9239 2012 Science Publications Pitch Analysis of Ukulele 1, 2 Suphattharachai Chomphan 1 Department of Electrical Engineering, Faculty

More information

Automatic Construction of Synthetic Musical Instruments and Performers

Automatic Construction of Synthetic Musical Instruments and Performers Ph.D. Thesis Proposal Automatic Construction of Synthetic Musical Instruments and Performers Ning Hu Carnegie Mellon University Thesis Committee Roger B. Dannenberg, Chair Michael S. Lewicki Richard M.

More information

Music Similarity and Cover Song Identification: The Case of Jazz

Music Similarity and Cover Song Identification: The Case of Jazz Music Similarity and Cover Song Identification: The Case of Jazz Simon Dixon and Peter Foster s.e.dixon@qmul.ac.uk Centre for Digital Music School of Electronic Engineering and Computer Science Queen Mary

More information

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Mohamed Hassan, Taha Landolsi, Husameldin Mukhtar, and Tamer Shanableh College of Engineering American

More information

Automatic Laughter Detection

Automatic Laughter Detection Automatic Laughter Detection Mary Knox 1803707 knoxm@eecs.berkeley.edu December 1, 006 Abstract We built a system to automatically detect laughter from acoustic features of audio. To implement the system,

More information

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS Andrew N. Robertson, Mark D. Plumbley Centre for Digital Music

More information

A probabilistic framework for audio-based tonal key and chord recognition

A probabilistic framework for audio-based tonal key and chord recognition A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium)

More information

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION th International Society for Music Information Retrieval Conference (ISMIR ) SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION Chao-Ling Hsu Jyh-Shing Roger Jang

More information

Automatic Piano Music Transcription

Automatic Piano Music Transcription Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening

More information

Figure 1: Feature Vector Sequence Generator block diagram.

Figure 1: Feature Vector Sequence Generator block diagram. 1 Introduction Figure 1: Feature Vector Sequence Generator block diagram. We propose designing a simple isolated word speech recognition system in Verilog. Our design is naturally divided into two modules.

More information

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis

Automatic characterization of ornamentation from bassoon recordings for expressive synthesis Automatic characterization of ornamentation from bassoon recordings for expressive synthesis Montserrat Puiggròs, Emilia Gómez, Rafael Ramírez, Xavier Serra Music technology Group Universitat Pompeu Fabra

More information

A New Method for Calculating Music Similarity

A New Method for Calculating Music Similarity A New Method for Calculating Music Similarity Eric Battenberg and Vijay Ullal December 12, 2006 Abstract We introduce a new technique for calculating the perceived similarity of two songs based on their

More information

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting Dalwon Jang 1, Seungjae Lee 2, Jun Seok Lee 2, Minho Jin 1, Jin S. Seo 2, Sunil Lee 1 and Chang D. Yoo 1 1 Korea Advanced

More information

BayesianBand: Jam Session System based on Mutual Prediction by User and System

BayesianBand: Jam Session System based on Mutual Prediction by User and System BayesianBand: Jam Session System based on Mutual Prediction by User and System Tetsuro Kitahara 12, Naoyuki Totani 1, Ryosuke Tokuami 1, and Haruhiro Katayose 12 1 School of Science and Technology, Kwansei

More information

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,

More information

Recognising Cello Performers using Timbre Models

Recognising Cello Performers using Timbre Models Recognising Cello Performers using Timbre Models Chudy, Magdalena; Dixon, Simon For additional information about this publication click this link. http://qmro.qmul.ac.uk/jspui/handle/123456789/5013 Information

More information

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES 12th International Society for Music Information Retrieval Conference (ISMIR 2011) A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES Erdem Unal 1 Elaine Chew 2 Panayiotis Georgiou

More information

Statistical Modeling and Retrieval of Polyphonic Music

Statistical Modeling and Retrieval of Polyphonic Music Statistical Modeling and Retrieval of Polyphonic Music Erdem Unal Panayiotis G. Georgiou and Shrikanth S. Narayanan Speech Analysis and Interpretation Laboratory University of Southern California Los Angeles,

More information

Melodic Outline Extraction Method for Non-note-level Melody Editing

Melodic Outline Extraction Method for Non-note-level Melody Editing Melodic Outline Extraction Method for Non-note-level Melody Editing Yuichi Tsuchiya Nihon University tsuchiya@kthrlab.jp Tetsuro Kitahara Nihon University kitahara@kthrlab.jp ABSTRACT In this paper, we

More information

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Research Article. ISSN (Print) *Corresponding author Shireen Fathima Scholars Journal of Engineering and Technology (SJET) Sch. J. Eng. Tech., 2014; 2(4C):613-620 Scholars Academic and Scientific Publisher (An International Publisher for Academic and Scientific Resources)

More information

Computer Coordination With Popular Music: A New Research Agenda 1

Computer Coordination With Popular Music: A New Research Agenda 1 Computer Coordination With Popular Music: A New Research Agenda 1 Roger B. Dannenberg roger.dannenberg@cs.cmu.edu http://www.cs.cmu.edu/~rbd School of Computer Science Carnegie Mellon University Pittsburgh,

More information

CSC475 Music Information Retrieval

CSC475 Music Information Retrieval CSC475 Music Information Retrieval Monophonic pitch extraction George Tzanetakis University of Victoria 2014 G. Tzanetakis 1 / 32 Table of Contents I 1 Motivation and Terminology 2 Psychacoustics 3 F0

More information

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS Item Type text; Proceedings Authors Habibi, A. Publisher International Foundation for Telemetering Journal International Telemetering Conference Proceedings

More information

Interacting with a Virtual Conductor

Interacting with a Virtual Conductor Interacting with a Virtual Conductor Pieter Bos, Dennis Reidsma, Zsófia Ruttkay, Anton Nijholt HMI, Dept. of CS, University of Twente, PO Box 217, 7500AE Enschede, The Netherlands anijholt@ewi.utwente.nl

More information

Introductions to Music Information Retrieval

Introductions to Music Information Retrieval Introductions to Music Information Retrieval ECE 272/472 Audio Signal Processing Bochen Li University of Rochester Wish List For music learners/performers While I play the piano, turn the page for me Tell

More information

Measurement of overtone frequencies of a toy piano and perception of its pitch

Measurement of overtone frequencies of a toy piano and perception of its pitch Measurement of overtone frequencies of a toy piano and perception of its pitch PACS: 43.75.Mn ABSTRACT Akira Nishimura Department of Media and Cultural Studies, Tokyo University of Information Sciences,

More information

PROBABILISTIC MODELING OF BOWING GESTURES FOR GESTURE-BASED VIOLIN SOUND SYNTHESIS

PROBABILISTIC MODELING OF BOWING GESTURES FOR GESTURE-BASED VIOLIN SOUND SYNTHESIS PROBABILISTIC MODELING OF BOWING GESTURES FOR GESTURE-BASED VIOLIN SOUND SYNTHESIS Akshaya Thippur 1 Anders Askenfelt 2 Hedvig Kjellström 1 1 Computer Vision and Active Perception Lab, KTH, Stockholm,

More information

Tempo and Beat Analysis

Tempo and Beat Analysis Advanced Course Computer Science Music Processing Summer Term 2010 Meinard Müller, Peter Grosche Saarland University and MPI Informatik meinard@mpi-inf.mpg.de Tempo and Beat Analysis Musical Properties:

More information

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES Vishweshwara Rao and Preeti Rao Digital Audio Processing Lab, Electrical Engineering Department, IIT-Bombay, Powai,

More information

A Bayesian Network for Real-Time Musical Accompaniment

A Bayesian Network for Real-Time Musical Accompaniment A Bayesian Network for Real-Time Musical Accompaniment Christopher Raphael Department of Mathematics and Statistics, University of Massachusetts at Amherst, Amherst, MA 01003-4515, raphael~math.umass.edu

More information

TOWARDS EXPRESSIVE INSTRUMENT SYNTHESIS THROUGH SMOOTH FRAME-BY-FRAME RECONSTRUCTION: FROM STRING TO WOODWIND

TOWARDS EXPRESSIVE INSTRUMENT SYNTHESIS THROUGH SMOOTH FRAME-BY-FRAME RECONSTRUCTION: FROM STRING TO WOODWIND TOWARDS EXPRESSIVE INSTRUMENT SYNTHESIS THROUGH SMOOTH FRAME-BY-FRAME RECONSTRUCTION: FROM STRING TO WOODWIND Sanna Wager, Liang Chen, Minje Kim, and Christopher Raphael Indiana University School of Informatics

More information

Analysis of local and global timing and pitch change in ordinary

Analysis of local and global timing and pitch change in ordinary Alma Mater Studiorum University of Bologna, August -6 6 Analysis of local and global timing and pitch change in ordinary melodies Roger Watt Dept. of Psychology, University of Stirling, Scotland r.j.watt@stirling.ac.uk

More information

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment

Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment Improvised Duet Interaction: Learning Improvisation Techniques for Automatic Accompaniment Gus G. Xia Dartmouth College Neukom Institute Hanover, NH, USA gxia@dartmouth.edu Roger B. Dannenberg Carnegie

More information

MUSI-6201 Computational Music Analysis

MUSI-6201 Computational Music Analysis MUSI-6201 Computational Music Analysis Part 9.1: Genre Classification alexander lerch November 4, 2015 temporal analysis overview text book Chapter 8: Musical Genre, Similarity, and Mood (pp. 151 155)

More information

Research on sampling of vibration signals based on compressed sensing

Research on sampling of vibration signals based on compressed sensing Research on sampling of vibration signals based on compressed sensing Hongchun Sun 1, Zhiyuan Wang 2, Yong Xu 3 School of Mechanical Engineering and Automation, Northeastern University, Shenyang, China

More information

Classification of Timbre Similarity

Classification of Timbre Similarity Classification of Timbre Similarity Corey Kereliuk McGill University March 15, 2007 1 / 16 1 Definition of Timbre What Timbre is Not What Timbre is A 2-dimensional Timbre Space 2 3 Considerations Common

More information

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series

Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series -1- Augmentation Matrix: A Music System Derived from the Proportions of the Harmonic Series JERICA OBLAK, Ph. D. Composer/Music Theorist 1382 1 st Ave. New York, NY 10021 USA Abstract: - The proportional

More information

A prototype system for rule-based expressive modifications of audio recordings

A prototype system for rule-based expressive modifications of audio recordings International Symposium on Performance Science ISBN 0-00-000000-0 / 000-0-00-000000-0 The Author 2007, Published by the AEC All rights reserved A prototype system for rule-based expressive modifications

More information

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST)

Computational Models of Music Similarity. Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Computational Models of Music Similarity 1 Elias Pampalk National Institute for Advanced Industrial Science and Technology (AIST) Abstract The perceived similarity of two pieces of music is multi-dimensional,

More information