Pitch Detection/Tracking Strategy for Musical Recordings of Solo Bowed-String and Wind Instruments

Similar documents
Topic 4. Single Pitch Detection

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Analysis, Synthesis, and Perception of Musical Sounds

Robert Alexandru Dobre, Cristian Negrescu

Query By Humming: Finding Songs in a Polyphonic Database

Simple Harmonic Motion: What is a Sound Spectrum?

ACCURATE ANALYSIS AND VISUAL FEEDBACK OF VIBRATO IN SINGING. University of Porto - Faculty of Engineering -DEEC Porto, Portugal

Topic 10. Multi-pitch Analysis

Automatic Singing Performance Evaluation Using Accompanied Vocals as Reference Bases *

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Voice & Music Pattern Extraction: A Review

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

A Parametric Autoregressive Model for the Extraction of Electric Network Frequency Fluctuations in Audio Forensic Authentication

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

A prototype system for rule-based expressive modifications of audio recordings

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

THE importance of music content analysis for musical

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

2. AN INTROSPECTION OF THE MORPHING PROCESS

CSC475 Music Information Retrieval

Automatic music transcription

Topics in Computer Music Instrument Identification. Ioanna Karydi

TIMBRE-CONSTRAINED RECURSIVE TIME-VARYING ANALYSIS FOR MUSICAL NOTE SEPARATION

The Research of Controlling Loudness in the Timbre Subjective Perception Experiment of Sheng

Tempo and Beat Analysis

Guidance For Scrambling Data Signals For EMC Compliance

LOUDNESS EFFECT OF THE DIFFERENT TONES ON THE TIMBRE SUBJECTIVE PERCEPTION EXPERIMENT OF ERHU

Transcription of the Singing Melody in Polyphonic Music

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

Automatic Construction of Synthetic Musical Instruments and Performers

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

Music Radar: A Web-based Query by Humming System

An Introduction to the Spectral Dynamics Rotating Machinery Analysis (RMA) package For PUMA and COUGAR

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

Audio Compression Technology for Voice Transmission

Measurement of overtone frequencies of a toy piano and perception of its pitch

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Musicians Adjustment of Performance to Room Acoustics, Part III: Understanding the Variations in Musical Expressions

Audio-Based Video Editing with Two-Channel Microphone

Music Information Retrieval with Temporal Features and Timbre

Acoustic Measurements Using Common Computer Accessories: Do Try This at Home. Dale H. Litwhiler, Terrance D. Lovell

Polyphonic music transcription through dynamic networks and spectral pattern identification

Available online at ScienceDirect. Procedia Computer Science 46 (2015 )

PHYSICS OF MUSIC. 1.) Charles Taylor, Exploring Music (Music Library ML3805 T )

Music Database Retrieval Based on Spectral Similarity

AN ALGORITHM FOR LOCATING FUNDAMENTAL FREQUENCY (F0) MARKERS IN SPEECH

CONTENT-BASED MELODIC TRANSFORMATIONS OF AUDIO MATERIAL FOR A MUSIC PROCESSING APPLICATION

Music Alignment and Applications. Introduction

Musical Signal Processing with LabVIEW Introduction to Audio and Musical Signals. By: Ed Doering

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Music Segmentation Using Markov Chain Methods

Transcription An Historical Overview

A NEW LOOK AT FREQUENCY RESOLUTION IN POWER SPECTRAL DENSITY ESTIMATION. Sudeshna Pal, Soosan Beheshti

Music Source Separation

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

SYNTHESIS FROM MUSICAL INSTRUMENT CHARACTER MAPS

Noise. CHEM 411L Instrumental Analysis Laboratory Revision 2.0

Using the new psychoacoustic tonality analyses Tonality (Hearing Model) 1

International Journal of Engineering Research-Online A Peer Reviewed International Journal

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

EMBEDDED ZEROTREE WAVELET CODING WITH JOINT HUFFMAN AND ARITHMETIC CODING

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

NEURAL NETWORKS FOR SUPERVISED PITCH TRACKING IN NOISE. Kun Han and DeLiang Wang

MUSICAL INSTRUMENT IDENTIFICATION BASED ON HARMONIC TEMPORAL TIMBRE FEATURES

A FUNCTIONAL CLASSIFICATION OF ONE INSTRUMENT S TIMBRES

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

HUMANS have a remarkable ability to recognize objects

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

ECG SIGNAL COMPRESSION BASED ON FRACTALS AND RLE

Pitch. The perceptual correlate of frequency: the perceptual dimension along which sounds can be ordered from low to high.

Improving Polyphonic and Poly-Instrumental Music to Score Alignment

Melody Retrieval On The Web

A Framework for Segmentation of Interview Videos

Lecture 9 Source Separation

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

Pitch is one of the most common terms used to describe sound.

AN ACOUSTIC-PHONETIC APPROACH TO VOCAL MELODY EXTRACTION

WE ADDRESS the development of a novel computational

Wipe Scene Change Detection in Video Sequences

Normalized Cumulative Spectral Distribution in Music

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

DETECTION OF SLOW-MOTION REPLAY SEGMENTS IN SPORTS VIDEO FOR HIGHLIGHTS GENERATION

Work Package 9. Deliverable 32. Statistical Comparison of Islamic and Byzantine chant in the Worship Spaces

Speech and Speaker Recognition for the Command of an Industrial Robot

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Onset Detection and Music Transcription for the Irish Tin Whistle

Reducing False Positives in Video Shot Detection

MPEG-7 AUDIO SPECTRUM BASIS AS A SIGNATURE OF VIOLIN SOUND

Physical Modelling of Musical Instruments Using Digital Waveguides: History, Theory, Practice

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

PulseCounter Neutron & Gamma Spectrometry Software Manual

Expressive Singing Synthesis based on Unit Selection for the Singing Synthesis Challenge 2016

MUSICAL APPLICATIONS OF NESTED COMB FILTERS FOR INHARMONIC RESONATOR EFFECTS

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

AN APPROACH FOR MELODY EXTRACTION FROM POLYPHONIC AUDIO: USING PERCEPTUAL PRINCIPLES AND MELODIC SMOOTHNESS

Transcription:

JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 25, 1239-1253 (2009) Short Paper Pitch Detection/Tracking Strategy for Musical Recordings of Solo Bowed-String and Wind Instruments SCREAM Laboratory Department of Computer Science and Information Engineering National Cheng Kung University Tainan, 701 Taiwan A pitch detection/tracking strategy for solo bowed-string and wind musical instrumental recordings is presented. To avoid the missing fundamental problem, we adopted the greatest common divisor method and modified it with a weighted-and-voting technique that can reveal more information of strong partials in the target signal. Moreover, a frame-based correction method with consideration of the performing aspects of the instruments is also proposed to emendate possible misjudgments in the transition from one note to the next note. Experimental results showed that the proposed strategy is superior to three popular methods for a pitch extraction/tracking task. The proposed method was also tested when the sound source is reverberant and the results were compared with other methods, too. Keywords: pitch detection, pitch tracking, bowed-string instrument, wind instrument, weighted greatest common divisor and vote (WGCDV) 1. INTRODUCTION Pitch detection, also referred to as fundamental frequency (F0) estimation, is a classical issue in the audio/speech processing areas. Many methods have been proposed in the literature and are still being researched nowadays. For example, zero crossings [1, 2], autocorrelation [3, 4], and harmonic product sum (HPS) [5, 6] are widely used ones. Systematic reviews of more details and other methods can be also found in [7-9]. Developing a context-free F0 estimator is a difficult task, whereas context-specific attempts work better in most cases. In most cases, to identify the exact pitch at every time instance may not be necessary because pitch resolution of human sensation is not very high for most people [10]. Even for those who have perfect pitch, they cannot identify the exact pitch every time they are asked for. If a clip of signal is too short, it is almost impossible to identify the pitch for listeners. In fact, many electronic instruments are unable to generate the required pitch for each note and they usually have 0 to 5 Hz deviation from standard pitches. Nevertheless, accurate pitch information is still necessary in applications such as structured audio coding and music information retrieval. Received October 15, 2007; revised February 27 & June 26, 2008; accepted July 25, 2008. Communicated by Chin-Teng Lin. 1239

1240 Since pitch is important for speech recognition/synthesis, numbers of pitch detection techniques were designed based on speech data. For examples, the Praat tool [11], developed by Boersma and Weenink aims at analyzing and manipulating digital speech data. Its pitch detection mechanism is practically a mixture of time-domain correlation methods. The STRAIGHT [12] proposed by Kawahara et al., based on the mono vocoder has had very good results for voice recognition and synthesis. Recently, a robust and accurate F0 estimation can be achieved by YIN estimator using the interplay between autocorrelation and cancellation [13]. They all contain a good F0 estimation tool. It is, however, not a trivial task to extract a set of usable pitch information to re-synthesizing recordings of the solos using the above methods [14]. In this paper, we propose a pitch detection/tracking strategy based on characteristics of such audio recordings of playing the instruments such as bowed-string instruments (violin and Erhu), brasses (trumpet) and woodwinds (oboe). They are all sustaining-driven musical instruments of unique and constantly changing timbres controlled by professional players. The proposed method is basically categorized as a frequency domain approach. Frequency domain approaches can not only provide an estimated pitch contour but also acquire the timbre characteristics while the analysis procedure is done. In musical analysis and synthesis aspects, the pitch detection is not necessarily the first step toward building a synthesis database. Instead, a detailed spectra analysis may obtain both pitch and timbre parameters, especially considering specific instrumental characteristics [14]. Building a practical music synthesis database, however, lies outside the scope of this paper. So, we focus on extracting a set of useful pitch information. The basic procedure is illustrated in Fig. 1. The audio samples are first divided into analysis frames. Then, shortterm Fourier transform (STFT) is adopted to convert the data into frequency domain. Based on the harmonic assumption of tones of target instruments, a method called weighted greatest common divisor and vote (WGCDV) is employed to find the possible pitch for each frame. By exploring the relationship among neighboring audio frames according to the knowledge of the instruments and performing aspects, a post-processing called frame based correction (FBC) is designed to correct the possible errors produced from the previous step. The simulation results showed that the proposed approach is more suitable for analyzing solo musical recordings of the target instruments than all previously given tools [11-13]. The rest of the paper is organized as follows. In section 2, the concept of WGCDV method is introduced and its detailed steps are given. FBC is presented in section 3. Computer simulation and case studies are given in section 4. The performances of different methods are also compared. Conclusions and future works are suggested in section 5. 2. WGCDV PITCH DETECTION METHOD Generally speaking, tones of most of sustaining-driven musical instruments, such as violin and trumpet, can have a longer lasting pitch than those of plucked or struck string instruments, such as guitar and piano. In this point of view, it seems an easier task to extract the pitch information from such specific musical instruments than general cases. However, some performing techniques, especially for sustaining-driven musical instruments, will introduce lots of obstructions to confuse most pitch detection strategies. For

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1241 Fig. 1. Proposed pitch detection/tracking method flowchart. example, there is no fret made for violin and Erhu. Thus, players can play fast trill, vibrato, portamento by taping or sliding the fingers on finger board and strings or applying the larger bowing pressure. All these are common in bowed-string instrument playing. The pitch variation may be over an octave for Erhu playing sometimes. There are other factors that reduce the accuracy of some F0 estimation algorithms. For example, the energy levels of the first two or three partials of Erhu are usually much weaker than higher partials. Based on our observation, such effects greatly bias the estimate. In our experience with different algorithms, if the pitch of a tone is misidentified, it is usually one octave higher or lower than the actual pitch. In fewer cases, it is 7 semitones higher than the actual pitch (1.5 times the actual fundamental frequency). If it falls into 1/2 semitone range, it is usually very close to the actual pitch identified by an invited Erhu player in advance. As shown in Fig. 1, WGCDV estimates F0 in three steps: (a) locates the peaks of the transformed magnitude information; (b) finds a likely GCD value for each partial pair using a look-up-table method; (c) weights the likely GCD values according to the spectral energy and determine the final GCD by voting. In the following sub-sections, we will discuss each step in more details. 2.1 Locate Likely Partial Positions Since our goal is to extract the pitch information from strong harmonic musical sig-

1242 nals, we first need to locate those large peaks as possible partial positions. After a frame of audio data was transformed into frequency domain, we calculate a smoothed spectrum using the mean filter. In the smoothed spectrum, there are three kinds of points, peak point, valley point, and slope point, with respect to local maximum, local minimum, and others. Taking Fig. 2 as an example, the protrude value P of peak point A then can be defined by VA P =, max( V, V ) (1) B C where V A is the magnitude of peak point A; V B and V C are the magnitudes of left valley point B and right valley point C, respectively. Fig. 2. Location of a peak-a in a smoothed spectrum where B and C are the left and right valley points. The protrude value shows how kurtosis a spectral peak is. To further reduce the number of possible partial positions, a protrude threshold T P (T P = 4 is used in section 4.) is introduced to reject those small peaks. It is noted that to examine the whole spectrum is not necessary because the target instrument always has its only compass. It is applicable, for a target instrument, to analyze from the lowest compass frequency to the frequencies in two or three octaves higher than the highest compass frequency that covers the dominant partials. This principle will apply to most of the procedures described afterward. 2.2 GCD Look-up Table Method For a pitch detection task, the greatest common divisor (GCD) method is more empirical than time-domain methods in the aspect of human understanding about how to determine the pitch of a sound. However, there are two problems that might decrease the efficiency. First of all, GCD is mathematically defined for positive integer numbers and is obviously limited by the frequency resolution of the transform. An excessive short or long window size introduces a larger offset from the possible pitch position. Secondly, most tones produced by musical instruments are quasi-periodic; the relation among their

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1243 partial components is usually inharmonic. For string instruments, the stiffness of the string causes the dispersion phenomenon [15]. It stretches the partial frequencies higher compared to harmonic frequencies. A better solution is to loosen the restriction of integer assumption. Without loss of generality, we can extend the GCD concept to the positive real numbers and use a look-up table (LUT) to map a floating-point quotient to its corresponding harmonic relation by examining which quotient in the harmonic relation table is the closest one to the quotient of a wait-for-examine partial pair. An implemented LUT is illustrated in Table 1. In this LUT, the quotient of any two partials can be calculated and matched with the closest one to determine the most possible harmonic relationship. Table 1. Greatest common divisor look-up table. numerator denominator numerator/denominator 1 2 0.5000 1 3 0.3333 1 4 0.2500 1 5 0.2000 2 5 0.4000 1 6 0.1667 1 7 0.1429 2 7 0.2857 3 7 0.4286 1 8 0.1250 3 8 0.3750 1 9 0.1111 2 9 0.2222 4 9 0.4444 1 10 0.1000 3 10 0.3000 1 11 0.0909 2 11 0.1818 3 11 0.2727 4 11 0.3636 5 11 0.4545 1 12 0.0833 5 12 04167 In section 2.1, it is not accurate enough if peak A in Fig. 2 is used as a partial point. Therefore, one has to estimate a floating-point peak position from the integral positions such as point A, B and C in Fig. 2. In this paper, a simple approximation by using 2nd order polynomial (a parabolic function) is adopted and the detailed algorithm can be found in the appendix. Let α i represent the estimated floating-point position of the ith peak. Before we use Table 1 to calculate a likely GCD for the (α i, α j ) pair, we need to keep 2α i < α j because the table was designed in optimal storage requirement and only contained the terms whose denominator is no less than two times of the corresponding numerator. For 2α i > α j cases, we simply replace α i with α j α i since the GCD of (α i, α j ) is mathematically equivalent to the GCD of (α j α i, α j ).

1244 Now we can determine a possible harmonic relation for each (α i, α j ) pair from the LUT. A likely GCD γ ij can be calculated directly from divided α j by the denominator in the LUT. For example, if the quotient of (α i, α j ) is close to 0.4, its harmonic relation will be (2, 5) and the likely GCD of this pair will be α j /5. 2.3 Energy Weighted and Voting After we calculate likely GCDs from all partial pairs in section 2.2, one needs to choose from them to determine F0. Since the critical partials are always of higher energy than most other frequency components, we design a weight factor corresponding to each partial pair according to their magnitudes. The advantage is to further reduce the effects of inharmonicity and noise. Let β i be the corresponding magnitude of α i and a weight factor w ij for γ ij is defined by w = min( β, β ). (2) ij i j To start a voting procedure, all likely GCDs are roughly assigned to several musical note partitions determined by a quantization factor Q, c ij γij = floor + 0.5. Q (3) Moreover, an indicator function can be defined by 1 if cij = k or cij = k + 1 θij ( k) =. 0 otherwise (4) Next, the weighted sum of each partition is evaluated by Sk ( ) = wij θij ( k). (5) i, j The most probable pitch position will fall into the partition of the greatest weighted sum. The centroid method [16] is then used to calculate a more accurate pitch position r by involving all the likely GCDs in that partition, i.e., r = i, j γij wij θij ( k). Sk ( ) (6) With the window size W and sampling frequency F S, the estimated fundamental frequency f p is obtained by f p r = Fs. (7) W

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1245 3. FRAME BASED CORRECTION METHOD In some occasions, very weak and unstable tones are produced because of light and uneven bowing or blowing pressure. In such cases, fundamentals may disappear or the tones are too weak to be detected for many pitch detection algorithms including the proposed WGCDV method. No matter how accurate a F0 estimation method based on a single audio frame is, its accuracy can be improved if the context information from consecutive frames is involved. The basic assumption of the pitch correction procedure is that the pitch from a note of any musical performance won t (may not) change abruptly. Thus, the first step is to segment the source into different note regions. In general, the spectrum has a large change in both timbre and energy in the transition region between two notes. A measure is defined in Eq. (8) to determine the degree of change for two successive frames. A f A f d = (8) f i( ) i 1( ), Ai ( f) where f is the frequency index and Ai () is the spectral magnitude function of the ith frame. It is worthy to note that d is equal to zero only when the spectra of two adjacent frames are identical. The degree of change will increase, whether the energy steeply varies or the timbre is reshaped. When d is greater than 0.7, a note change is considered. Another measure is that the duration of one note cannot be shorter than the human physical reaction time. Because of the skill limitation of a human performer, two changing points should not occur in a very short time, said less than one semiquaver or one eighth second. In such a situation, one of the changing points can be eliminated to get a clear cut between two notes. A reference pitch of each note region can be decided by using the median of all estimated pitches of the frames in the note region after note regions are segmented. As shown in Fig. 3, the note region designated between changing points g and h is shorter than 0.125 second (about 5 hop sizes if the hop size is 1024 at 44.1 khz sampling rate). The changing point h should be removed because the estimated pitch of point h is different from the estimated pitches of its adjacent frames. As mentioned above, we suppose that there should not be any abrupt and large pitch change within a note region. Small changes in pitch are, however, allowed due to that vibrato and portamento are common playing techniques for the target musical instruments in this paper. Fortunately, the pitch changes caused by vibrato and portamento within a short period of time are usually less than one octave. Thus, if the pitch variation of adjacent frames is larger than an octave or the estimated pitch of one frame is an octave offset from the reference pitch in this note region, FBC assumes that there is an error to be corrected. The new pitches of misjudged frames will be interpolated from those of neighboring frames as the example shown in Fig. 4. Although the proposed FBC was designed according to specific characteristics of some musical instruments, it can be modified for other situations, such as human voices, by taking consideration of vocal features. It is also noted that the FBC method is developed independently of the WGCDV method and can be applied to other pitch detection schemes, too.

1246 Fig. 3. Example of an ambiguous note change detection of an Erhu. Fig. 4. Pitch adjustment before and after frame based correction (FBC) in a note region. 4. EXPERIMENTAL RESULTS AND DISCUSSION Recordings of solo performance using Erhu, trumpet and violin are adopted to test the proposed strategy. A synthesis song produced by a Wavetable synthesized oboe is also provided as a contrast set. Some mono sound materials are sampled in 44.1 khz with 16 bits resolution and available on [17]. The experimental results of WGCDV, HPS, Praat, and YIN are listed in Table 2. Each of the methods combined with FBC are also tested. The frame size is 2,048 with 50% overlap between two adjacent frames and the STFT window type is Hamming. An estimation error rate is designed to evaluate their performances and can be calculated by Ferror e = 100%, (9) F total where F total is the total number of non-silence audio frames and F error is the number of frames where wrong estimates occur. The actual pitch of each frame is identified manually by a musician who is an Erhu player. When the estimated pitch falls within a half of a semitone distance from the actual pitch (about 2.973% margin), it is denoted as a correct estimate. Table 2 shows the performances of the methods. While most methods are quite good for signals that are easy to analyze, such like the synthetic sound, there are

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1247 Table 2. Estimation errors with different programs. (2.973% margin) (a) Spectrum. (b) Actual (solid line) and estimated pitch contours. Fig. 5. Missing fundamental case. some occasions that can make the detectors confused. To illustrate the reliability of these methods in more details, we discuss three special cases as follows. The first case is the missing fundamental problem. The second tone shown in Fig. 5 is a typical missing fundamental sound in which the fundamental s energy is far below the other partials. Spectrogram clearly shows that the energy of the fundamental component (~ 300 Hz) stays below the noise floor. The actual (solid line) and estimated pitch contours are indicated in bottom subplot. Most detectors mentioned in this paper work well except some reasonable errors due to the strong energies of the second and fourth partials. After FBC is applied, most errors can be corrected. It is noted that Praat failed the test in the later half of the tone. In

1248 addition, perceptual based detectors should be good performers in this regard, too [18]. The second case is the under-estimated case. The top subplot in Fig. 6 shows its spectrogram in which we can observe that there is strong energy appearing in the regions around 0.5 F 0 and 1.5 F 0 as well. This often appears when Erhu is played with low bowing speed and small bowing pressure. For most frequency-domain methods, detection errors easily occur because of its seemingly harmonic structure. Compared to HPS, the proposed WGCDV method prevents some misjudgments due to the weighted and voting strategy. (a) Spectrum. (b) Actual (solid line) and estimated pitch contours. Fig. 6. Under-estimated case. The third case is the reverberation case. We use the reverb function of Adobe Audition 2.0 to add different degrees of reverberation (delay time = 50, 100, and 150 ms, respectively). Fig. 7 shows the spectrograms of a synthesis signal and a processed signal. Table 3 shows the results of all the methods. One can see that the proposed method performs better in the high reverberant case, but YIN and PRAAT outperform the proposed method in the other two cases. An oboe song synthesized with wavetable synthesis method is used for example. It is noted that a clear harmonic structure remained due to the lingering sound of the preceding tone. This phenomenon confuses all pitch detectors and delays the correct estimation of the new pitch. The WGCDV method is again benefit from the weighting and voting strategy and has the best average performance. The last experiment is to test the accuracies of all methods. Synthesis signals of different pitches are produced. The pitches are 440 Hz, 450 Hz, 460 Hz and 470 Hz, respectively. Table 4 shows the average results of 80 frames. It is found that PRAAT is the best performer. WGCDV and YIN performs less well at 460 Hz, but the error is still much less than a semitone. Similar experiments and analysis are performed over various bowed-string and wind musical instruments. In our experiments, bowed-string instruments are more difficult

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1249 (a) Original synthesis signal. (b) Reverberant synthesis signal (delay time = 150 ms). Fig. 7. Spectrograms. Table 3. Estimation errors for reverberant signals with different methods. (2.973% margin) method delay time HPS HPS + FBC WGCDV WGCDV + FBC PRAAT YIN 50 19.64 15.47 11.90 10.11 8.92 11.90 100 31.54 27.97 30.95 23.21 26.78 19.64 150 41.07 35.11 32.14 23.80 32.73 29.16 Table 4. Tests of accuracies of different methods (in Hz). WGCDV HPS PRAAT YIN 440 439.8185 439.2775 440.0005 440.001 450 450.3271 452.1973 448.5632 450.003 460 463.0937 463.5023 460.0002 463.3192 470 470.2302 469.4239 469.9911 469.9989 than wind instruments. The reasons why these methods produced unsatisfactory results are quite similar. That is, the testing samples are extracted from commercially available compact disks, and they usually contain certain degree of reverberation. The proposed WGCDV + FBC method performs well in the provided samples. However, all methods performed poorly if the signals are overly reverberant. One overly reverberant example can be heard from [17]. More investigation is required in this aspect.

1250 5. CONCLUSION A pitch detection method called weighted greatest common divisor and vote (WGCDV) for recordings of solo bowed-string and wind instruments is presented. The proposed method was tested over a wide range of audio recordings extracted from commercially available compact disks. The idea of GCD look-up table method makes the GCD approach detour its mathematical restriction and provide a more intuitive estimate than the traditional way. Based on the performing aspects of the target instruments, a frame-based correction (FBC) method is also proposed to track the pitch contour and improve the existing methods. The proposed strategy is also compared favorably to several pitch tools and achieves a better performance in most test recordings. As mentioned in [14], tracking the rapid pitch variation more accurately may be more important than finding the exact hertz of a tone. Most listeners do not feel the pitch problem with the re-synthesis results if there is no large pitch tracking error. This re-synthesis software is also available at [17] for reference. The lightweight computation makes the proposed strategy a practical approach to design a real-time analysis and synthesis application for solo bowed-string and wind instruments. APPENDIX In this appendix results required for the parabolic approximation are derived. First of all, we try to find a peak from three adjacent points (x 1, y 1 ), (x 2, y 2 ), and (x 3, y 3 ) with the following relationships, x1 = x2 1 x3 = x2 + 1. y2 > y1 y2 > y3 (10) The first two relationships indicate that the parabolic function can be centered upon the second point and the corresponding coordinates can be rewritten as ( 1, y 1 ), (0, y 2 ), and (1, y 3 ) in the first place. Now we start from a generic parabolic function Eq. (11) to interpolate the peak point (x, y ) as illustrated in Fig. 8. Fig. 8. Three-point parabolic approximation.

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1251 2 y = ax + bx + c. (11) Substituting the given three points for the equation variances, we have y1 = a b+ c y2 = c. (12) y3 = a+ b+ c We can derive a, b, and c from Eq. (12), y + y 2y a = 2 y3 y1 b = 2 c = y2 1 3 2. (12) As we know, the peak point exists where the first-order derivative is zero, y = 2ax + b = 0. x (13) The peak position will be b/2a and its value will be c (b 2 /4a). As a result, the formula solutions can be written as b y3 y1 x = =, 2a 2( y + y 2 y ) 2 3 1 2 2 3 y1 b ( y ) y = c = y2. 4a 8( y + y 2 y ) 3 1 2 (14) (15) It is noted that the approximated peak position is denoted as an offset related to the second given point. REFERENCES 1. B. Kedem, Spectral analysis and discrimination by zero-crossing, in Proceedings of IEEE, Vol. 74, 1986, pp. 1477-1493. 2. L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, Prentice- Hall, New Jersey, 1978, pp. 116-130. 3. C. Roads, Autocorrelation pitch detection, Computer Music Tutorial, MIT Press, 1996, pp. 509-511. 4. O. Deshmukh, C. Y. Espy-Wilson, A. Salomon, and J. Singh, Use of temporal information: Detection of periodicity, aperiodicity, and pitch in speech, IEEE Trans-

1252 actions on Speech and Audio Processing, Vol. 13, 2005, pp. 776-786. 5. A. M. Noll, Pitch determination of human speech by the harmonic product spectrum, the harmonic sum spectrum, and maximum likelihood estimate, in Proceedings of the Symposium on Computer Processing in Communications, 1969, pp. 779-797. 6. H. Quast, O. Schreiner, and M. R. Schroeder, Robust pitch tracking in the car environment, in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Vol. 1, 2002, pp. I-353-I-356. 7. W. J. Hess, Pitch Determination of Speech Signals, Springer-Verlag, New York, 1983. 8. W. J. Hess, Pitch and voicing determination, in Advances in Speech Signal Processing, 1992, pp. 3-48. 9. D. J. Hermes, Pitch analysis, Visual Representations of Speech Signals, John Wiley & Sons, England, 1993, pp. 3-25. 10. B. C. J. Moore, An Introduction to the Psychology of Hearing, 4th ed., Academic Press, San Diego, 1997. 11. P. Boersma and D. Weenink, Praat: Doing phonetics by computer, (Version 4.5.13) [Computer program], retrieved, 2007, http://www. praat.org/. 12. H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigné, Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneousfrequency-based F0 extraction: Possible role of a repetitive structure in sounds, Speech Communication, Vol. 27, 1999, pp. 187-207. 13. A. de Cheveigné and H. Kawahara, YIN, a fundamental frequency estimator for speech and music, Acoustical Society of America Journal, Vol. 111, 2002, pp. 1917-1930. 14. Y. S. Siao, W. L. Chang, and A. Su, Analysis and transsynthesis of solo Erhu recordings using adaptive additive/subtractive synthesis, 120th Convention of Audio Engineering Society, Paris, 2006. 15. H. Järveläinen, V. Välimäki, and M. Karjalainen, Audibility of the timbral effects of inharmonicity in stringed instrument tones, Acoustics Research Letters Online, Vol. 2, 2001, pp. 79-84. 16. R. Honsberger, Episodes in Nineteenth and Twentieth Century Euclidean Geometry, Mathematical Association of America, Washington, 1995. 17. Erhu Analysis/Synthesis Tool, http://scream.csie.ncku.edu.tw/~al/erhusynthwww/. 18. A. de Cheveigné, Pitch perception models, Pitch, Springer, New York, Vol. 24, 2005, pp. 169-233. Yi-Song Siao ( ) received his M.S. and B.S. degrees in Computer Science and Information Engineering from National Cheng Kong University, Tainan, Taiwan, in 2003 and 2005 respectively. He began learning the Erhu at the age of thirteen, and extended this interest to his study. In 2004, he proposed the JavaOL (120th Convention AES, May 2006) concept which improves the performance and flexibility of the MPEG-4 Structured Audio. In 2005, he applied the additive synthesis method on synthesizing the Erhu sound and built an interactive analysis/synthesis tool. His research interests include computer music, audio signal processing, GUI design, and computer graphics.

PITCH TRACKING FOR SOLO BOWED-STRING AND WIND RECORDING 1253 Wei-Chen Chang ( ) was born in Taipei, Taiwan, R.O.C., in 1975. He received the B.S. degree in Mathematics, the M.S. degree and the Ph.D. degree in Computer Science Information Engineering from National Cheng Kung University, Taiwan, in 1997, 2002 and 2008, respectively. From 2007 to 2008, he was a visiting scholar in IRCAM, Paris, where he worked on polyphonic estimation and tracking. His research activities include data compression, signal processing, model-based music synthesis, and machine learning. Alvin W. Y. Su ( ) received his B.S. degree in Control Engineering from National Chiao Tung University, Hsinchu, Taiwan, R.O.C., in 1986. He received his M.S. and Ph.D. degrees in Electrical Engineering from Polytechnic University, Brooklyn, New York, in 1990 and 1993, respectively. From 1993 to 1994, he was with CCRMA, Stanford University, Stanford, California. From 1994 to 1995, he was with CCL (Computer and Communication Lab.), ITRI, Taiwan. In 1995, he joined the Department of Information Engineering, Chun Hwa University, Hsinchu, Taiwan. In 2000, he joined the Department of Computer Science and Information Engineering of National Cheng Kung University (NCKU), where he served as an Associate Professor. He is the director of Campus Information System Group of NCKU. He is the director of SCREAM (Studio of Computer REseArch on Music and Multimedia), NCKU. His research interests cover the areas of digital audio/video signal processing, physical modeling of acoustic instruments, multimedia data compression, P2P multimedia streaming systems, embedded systems, VLSI signal processor design and ESL (Electronic System Level) tool design.