HUMANS have a remarkable ability to recognize objects

Size: px

Start display at page:

Download "HUMANS have a remarkable ability to recognize objects"

Dennis Sherman
5 years ago
Views:

1 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER Musical Instrument Recognition in Polyphonic Audio Using Missing Feature Approach Dimitrios Giannoulis, Student Member, IEEE, and Anssi Klapuri, Member, IEEE Abstract A method is described for musical instrument recognition in polyphonic audio signals where several sound sources are active at the same time. The proposed method is based on local spectral features and missing-feature techniques. A novel mask estimation algorithm is described that identies spectral regions that contain reliable information for each sound source, and bounded marginalization is then used to treat the feature vector elements that are determined to be unreliable. The mask estimation technique is based on the assumption that the spectral envelopes of musical sounds tend to be slowly-varying as a function of log-frequency and unreliable spectral components can therefore be detected as positive deviations from an estimated smooth spectral envelope. A computationally efficient algorithm is proposed for marginalizing the mask in the classication process. In simulations, the proposed method clearly outperforms reference methods for mixture signals. The proposed mask estimation technique leads to a recognition accuracy that is approximately half-way between a trivial all-one mask (all features are assumed reliable) and an ideal oracle mask. Index Terms Audio signal processing, bounded marginalization, harmonic sound, missing data techniques, musical instrument recognition. I. INTRODUCTION HUMANS have a remarkable ability to recognize objects based on partial and incomplete information. We need to see only the tail of a dog or the corner of a smartphone to recognize them. Similarly in the auditory domain, a signicant portion of the time-frequency domain information of a sound can be replaced by noise and we can still usually recognize it with ease, especially when presented in context [1], [2]. Moreover, the interference can be highly time-varying, as is the case in our living environment such as a busy city street where multiple sound sources compete to dominate dferent regions of the time-frequency domain. The auditory system appears to effortlessly ignore noisy spectrotemporal regions, provided that a sufficient amount of other regions consistently support the presence of a given sound source or event. Besides recognizing sounds, the Manuscript received July 25, 2012; revised January 31, 2013; accepted February 13, Date of publication February 25, 2013; date of current version July 11, The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Bryan Pardo. D. Giannoulis is with the School of Electronic Engineering and Computer Science, Queen Mary University of London, London E1 4NS, U.K. ( dimitrios.giannoulis@eecs.qmul.ac.uk). A. Klapuri is with Ovelin Ltd., Helsinki, Finland, and also with Tampere University of Technology, Tampere, Finland ( anssi.klapuri@tut.fi). Color versions of one or more of the figures in this paper are available online at Digital Object Identier /TASL auditory system also partly restores the corrupted spectrotemporal data based on previously stored information [1]. Music signals represent a large class of audio data where several sound sources are usually present at the same time. Depending on the genre, the instrumentation may consist of electric guitars, bass, drums, and vocals; or saxophone, piano, strings and percussion, for example. There is a wide variety of instruments in Western music alone, representing dferent sound production mechanisms and timbre [3]. Automatic recognition of the instruments in recorded music has several direct applications, including music retrieval based on the instrumentation and audio management in recording studios. Even more importantly, sound source recognition and modeling is an essential part of making sense of complex audio signals. When listening to polyphonic music, human listeners are able to perceptually organize the component sounds to their sources, largely based on timbre information. Similarly, source models are an integral part of music transcription and sound separation systems, where the source identity enables the use of source-specic models and assumptions and allows the organization of sounds events to streams that can be attributed to certain instruments [4], [5]. What makes instrument recognition particularly hard in polyphonic music is that often the source to be recognized corresponds to only a small fraction of the total energy of the mixture signal. The interference caused by the other instruments is highly non-stationary and unpredictable as the identities of the other instruments are usually not known either. A number of dferent approaches have been proposed for polyphonic musical instrument recognition (see [6] [9] for reviews). The most straightforward one is to extract acoustic features directly from the mixture signal. For example, Little and Pardo [10] trained binary classiers using weakly-labelled polyphonic audio in order to detect the presence of individual instruments in music. Kitahara et al. [8] used linear discriminant analysis to identy acoustic features that are robust to the interference of other sources. Essid et al. [11] developed a system for recognizing instrument combinations directly. Secondly, one can attempt to separate the signals of dferent instruments from the mixture and then classy each signal separately. This approach has been widely used, and there are a number of dferent mechanisms to perform the source separation [12] [16]. Obviously, sound separation is a hard problem in itself and is often the bottle-neck of these systems although perfect separation is not needed for classication purposes, especially the class models are trained using similar data with separation artefacts [16]. Thirdly, one can perform source separation and recognition jointly, for example based on statistical inference within parametric signal models [17] [19], IEEE

2 1806 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 or by employing sparse representations where a mixture signal is represented as a weighted sum of atoms with pitch and instrument labels [20]. Eggink and Brown [21] introduced the missing feature approach to musical instrument recognition. In these techniques, one attempts to identy spectrotemporal regions that represent reliable observations of a sound source in contrast to regions that are corrupted by noise or interfering sources and are therefore labeled as unreliable or missing [2]. The reliability information is often represented in the form of a time-frequency mask which can be either binary or real-valued, representing a measure of confidence for the spectral components. Many natural sounds, including musical sounds, are sparse in the time-frequency domain and as a consequence it is reasonable to expect that portions of their spectrogram data remain uncorrupted even in the presence of other sources. Missing feature approaches provide a general framework for recognizing sound sources based on partial information [2], [22], [23]. There are two broad classes of techniques for handling spectrotemporal elements that have been identied as unreliable or missing: feature vector imputation and marginalization. In feature imputation, one attempts to restore the unreliable elements of a feature vector based on the reliable ones, before using it for classication. In marginalization, the classier itself is modied: the unreliable features can be completely marginalized (excluded from the classication) or the observed power-spectral feature can be considered as an upper bound for the missing clean feature value and bounded marginalization can be applied [24]. Obviously, estimating the reliability (or, mask) information automatically from a mixture signal is dficult and constitutes a central part of these approaches. Various approaches to mask estimation will be discussed in Section II-E. For a comprehensive overview of missing feature algorithms, see [22], [23]. Despite the fact that missing feature techniques are an excellent match for sound source recognition in the presence of time-varying interference, there has been very little research to this direction in instrument recognition since the work of Eggink and Brown [21]. One of the potential reasons is the dficulty of mask estimation in music, where both the target and the interfering sources represent the broad class of musical instruments which is less strictly defined than speech signals or stationary noise sources. A recent technique, loosely related to missing feature techniques, is the system of Barbedo and Tzanetakis [25] that employs pitch and onset detection to identy clean (reliable) partials and then perform instrument recognition based on individual harmonic partials. In this paper, we propose a missing-feature algorithm for musical instrument recognition, handling unreliable features with bounded marginalization. The main aim is to perform instrument recognition in polyphonic music without the preceding step of sound separation and without requiring reliable multipitch detection. This is because the mentioned steps are errorprone and therefore can become a bottle-neck for the recognition. Instead, we propose a novel mask estimation method that calculates the probabilistic reliability of dferent feature vector elements, and then propose an algorithm for marginalizing the mask. Models for each instrument class are trained from clean data (isolated signals), since the combination of instruments at the test stage is not known in advance and therefore generating training data with matched interference is not possible. The models are constructed for spectral-domain feature vectors because on these the interference of other sources can be considered to have only a local effect, contrary to cepstral vectors where interference is spread all across the feature vector. We propose new features based on log-energy dferences between subbands, removing the need for level normalization which would otherwise be problematic when part of the spectrum is labeled as missing. The proposed method performs instrument recognition independently in each analysis frame. There are obvious ways of extending the method to integrate information over time and to include temporal features in the proposed framework. However temporal processing is not specic to the proposed techniques and is therefore beyond the scope of this paper. Simulation results are reported for mixtures of recorded musical sounds. In the polyphonic scenario, the proposed method clearly outperforms a reference method that uses Mel-frequency cepstral coefficients (MFCCs) as features and a Bayesian classier with Gaussian mixture models (GMMs) to represent the class-conditional likelihood distributions. II. METHOD In the following, we consider instrument recognition within an individual time frame. Let us denote the observed time-domain signal,,inframe by vector. The observation is modelled as a mixture of harmonic sounds and a residual: where denotes the pitch of sound and the set contains pitches of all sounds that are active in frame. The residual signal represents all non-harmonic sounds such as drums or background noise. For convenience, we omit the frame index in the following since the processing is identical in all analysis frames and write (1) as. The problem addressed in this paper is to calculate : the probability that a sound from instrument class is present in the observed analysis frame. That can be written as where the candidate pitch sets and their probabilities are obtained using the multipitch detector described in [26]. For simplicity, here we set the probability for the detected set of pitches to one, which makes the sum vanish in (2). In other words, the multipitch detector determines a single set of active pitches. However, it is important to note that for classication purposes we do not need to assume a perfectly reliable the multipitch estimator. It has only a gatekeeper role in admitting sound candidates to the classication stage: If the set includes wrong pitch values, we run a risk that the corresponding spurious sounds may by chance sound like a real instrument, but the risk is low the classication stage works well. If a truly existing pitch is not detected, we miss the opportunity to (1) (2)

3 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1807 use the corresponding sound to detect the presence of an instrument. However there are usually a number of other notes from the same instrument (although not necessarily in the same analysis frame), so that is not critical. The classication stage consists of calculating in (2): the probability that class is present given and.in polyphonic music, one sound from class suffices to conclude that class is present. In other words, class is not present, none of the sounds belongs to class. Assuming that the class membership probabilities of individual sounds are independent of each other, the latter can be written as where denotes the probability that sound belongs to class. We make two further simplying assumptions and write On the right hand side, we have assumed that the class probability of sound depends on its pitch, but not on the pitches of the other sounds. Vector is a feature vector extracted from the mixture signal to represent the sound with pitch. Replacing with on the right hand side states the assumption that all the information regarding the class of sound is distilled into and therefore reduces to. The focus of this paper is on calculating,thatis, the probability that a candidate sound belongs to class when given its pitch and the feature vector. What makes this problem interesting is that the feature vector is extracted from the mixture signal and is therefore usually partly obscured by other co-occurring sounds that overlap in the time-frequency domain. Probabilistic models representing instrument are typically trained using clean feature vectors extracted from isolated signals representing instrument. This is because the interference caused by other, co-occurring, sounds in polyphonic music is highly varying and unpredictable and therefore any interference introduced at the training stage would hardly be representative of the interference present at the test stage. The problem can then be re-stated as calculating the probability when statistical models representing class have been trained from clean data and we do not know which elements in the feature vector are reliable (clean) and which are obscured. The approach taken in the following is to estimate a mask vector that is of the same size as and contains probabilities that the dferent elements of are reliable. The mask is then marginalized in the classication process. However let us first look at the feature representation. A. Feature Representation A variety of acoustic features have been used for instrument recognition, including spectral, cepstral, temporal, and modulation-spectral features. For a comprehensive review and comparative evaluations, see [7], [11], [27], [28]. In the missing feature framework, the features should be local in time-frequency since it is in this domain where the interfering sounds tend to have a sparse energy distribution and therefore only a local effect on (3) (4) Fig. 1. Illustration of the transformation matrix for. the features. In the following, we propose spectral features suitable for musical instrument recognition in polyphonic signals. The feature vector is calculated in two phases. First, the harmonic partials for a sound with pitch are picked from the mixture spectrum by assuming perfect harmonicity, that is, assuming that the frequencies of overtone partials are integer multiples of the pitch. Extracting spectral energy only at the positions of the harmonic partials usually considerably improves the signal-to-noise ratio from the viewpoint of the candidate pitch and is one of the main reasons for needing to assume candidate pitch values instead of extracting features directly from the mixture spectrum. Essentially, we assume that meaningful information about the source identity is only sampled at the positions of the harmonic partials. Let vector denote the squared magnitudes of harmonic partials in the observed mixture spectrum at frequencies. These are linearly transformed and subjected to logarithmic compression: The transform matrix is designed to map from a linear to log-frequency resolution and thereby reduce the dimensionality and improve the statistical properties of the features. It is given by where otherwise. The parameter determines the log-frequency resolution of the features, leading to a third-octave resolution, for example. For a sufficiently small, becomes an identity matrix. In the following, we use the term subband to refer to the elements of the feature vector. The center frequencies of the subbands (elements of ) depend on pitch are defined recursively by setting and This ensures that all elements of represent subbands where there is at least one harmonic partial rather than only gaps between the partials (especially between the lowest few partials where the gaps are wide on the log-frequency scale). Fig. 1 illustrates the linear transform for. (5) (6) (7) (8)

4 1808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 B. Binary Mask Above, the powers of the harmonic partials were extracted from the mixture signal. The features are typically partly obscured by the other sounds that co-occur and whose partials overlap those of the candidate sound.thisisparticularly likely in music, where harmonic pitch relationships and synchronous timing are favoured and therefore partial collisions are a rule rather than an exception. Let us use to denote the unobserved, clean, harmonic partial powers of sound in isolation. Furthermore, let us denote by the corresponding unobserved feature vector that we would obtain the clean sound was presented separately. Let us define binary masks,where indicates that the measured log-power for subband is dominated by energy actually coming from the source with pitch. More exactly, we assume that the (unobserved) clean feature vector obeys The latter assumption stems from the fact that the expected value of the power spectrum of the mixture signal is the sum of the power spectra of sources,. This is valid only in the expectation sense, since individual spectral components can either add up or cancel out each other depending on their relative phases. Despite of the approximative nature of (9), the stated assumption is valuable for classication purposes as explained below. Meaningful binary masks often exist in practice because natural sound sources tend to be sparse in the time-frequency domain and therefore the underlying ideal masks of the sources often contain a considerable amount of non-zero values in the sense that. These clean glimpses of the sources form a basis for the recognition. However note that also the subbands where inform about : the observed feature value sets an upper bound for the unobserved clean feature value of source at band, according to the approximative assumption made in (9). To keep the notation uncluttered, we omit the subscript in the following and write simply,,and, since the pitch of the corresponding sound will be evident from the context (we will condition the probabilities by,seebelow).furthermore, we will use to denote an ordered set of subband indices where,sothat and only. The set is ordered so that.the corresponding values of arestoredinset so that. The assumptions stated in (9) allow us to write the probability density function (pdf) of the unobserved clean features of sound : and and (9) (10) where is the Dirac delta function and is the distribution of given values of at subbands where. The distribution is learned using isolated sounds from all dferent classes. Value is a normalizing constant (defined later) to make the pdf sum to unity since the pdf is truncated to be zero above. The pdf of the clean features as given by (10) is valuable because the statistical models for dferent classes are trained from clean data, for reasons described above. However, in order to use (10) for classication, we need to marginalize the value of in the classication process. The presentation in the next subsection follows roughly the one in [24], although the employed model and features are dferent. C. Marginalization of the Missing Data and the Mask Given a candidate pitch and the corresponding feature vector, let us use to denote the probability that matches the ideal (unobserved) mask of. The ideal (aka oracle ) mask has ones at positions where and zeros elsewhere. 1 Note that the probability represents confidence in the estimated mask, whereas the mask vector itself represent confidence in the observed feature vector elements (as described by (9)) these two should not be confused. The probability that a candidate sound belongs to class, as required in (4), can be written as (11) where denotes the joint probability that the sound belongs to class and matches its true mask. Above, is short for,thatis,summing over all possible binary masks. The joint probability in (11) can be written as (12) where is given by (10) and the integral is used to marginalize. The factor simplies to since does not depend on or given. Using Bayes rule for, (12) becomes (13) where is the prior probability of class at pitch and is the likelihood of observing for class and pitch. The latter can be estimated from training data representing isolated (clean) signals from class. The pdf is estimated similarly but using data from all classes. 1 More precisely, here means that,where values of 3 6 [db] were found suitable (see Section II-E).

5 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1809 The above-described approach for computing is theoretically satisfying but requires two problems to be solved in order to be practically useful. Firstly, summing over in (11) and integration over in (13) are not computationally feasible in a direct form. Secondly, meaningful values for are required. These problems are addressed in Sections II-D II-F below. D. Factorial Form for the Multidimensional Density Let us first address the computational complexity of the integral in (13). The factor in the brackets in (13) is assumed to take the following factorial form where on the last row we have introduced the shorthand notation for convenience in the following. In the special case where all subbands are clean (i.e., the mask is all-one), (15) would reduce to. For the noisy bands,, we calculate the level dference between band and its nearest clean subband.more precisely, denotes member in set that is nearest to. The subband is used as a point of reference for band for which. Similarly, is used to denote the second-nearest member of to. We assume that depends only on and the level dference between the two nearest clean subbands, but not on the other bands. As a result, the part indicated as noisy bands in (15) can be written as (14) wherewehaveused to denote a shorter feature vector containing only the elements at clean subbands,.fortheother elements, the above equation assumes that for are independent of each other given and the values at the clean subbands. The above assumption allows us to put the integral inside the product in (14), writing it as Based on (10), we can write the pdf of for for as (17) (18) (15) where for the clean bands we have used from (10) to obtain the first term above. The integral in (15) is over each element of separately and this makes the (originally multidimensional) integral tractable. Another important requirement is that should be invariant to the presentation level (scaling) of sound,appearing as an additive constant in the log-power features. (Note that we cannot normalize the scale since some of feature vector elements are obscured and therefore not available.) To achieve that, we only consider level dferences between subbands. Let us use as a shorthand to denote the level dference between subbands and. We assume that the level dference of each neighboring pair of clean subbands depends only on and the level dferences on both sides, and, but not on other subbands. This assumption is made for computational tractability. Retaining the dependency on the dferences on both sides is important since the mentioned dferences share one subband and are therefore strongly correlated. We can then write the part indicated as clean subbands in (15) as (16) where recall that is just a shorthand for.the normalizing constant is required because the pdf is truncated. Its value is Substituting (18) and (19) into (17) we get (19) (20) Finally, substituting (15) and (17) into (13), we can write (11) as (21) Calculation of the terms and requires estimating the distributions from training data. In practice, the joint distributions are estimated for all possible triplets,,, separately for all dferent classes,andfor the case where the distributions are not conditioned on the class at all, that is, from training material representing all classes.

1810 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 We use a multivariate Gaussian distribution with full (2 2) covariance matrices to model the densities.

6 1810 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 We use a multivariate Gaussian distribution with full (2 2) covariance matrices to model the densities. This renders the conditional distribution to be univariate Gaussian (see Appendix). The value of the integral in (20) is then obtained from the Gaussian cumulative distribution. E. Mask Estimation Mask estimation is a central and arguably the most dficult part of missing feature techniques. Hardly any methods have been proposed for the mask estimation in music, although it should be noted that any instrument recognition system that is based on separating sources from the mixture can be viewed to produce a spectrographic masks for the sources as a sideproduct [12] [16]. The same can be said about methods intended for musical sound separation in general, especially those based on spectrogram factorization [29] [31]. A number of mask estimation methods have been proposed in the field of environmentally robust speech recognition [22], [32]. One approach is to estimate the power spectrum of additive noise by assuming that it is slowly-varying, and then use it to estimate the signal-to-noise ratio at each time-frequency location [33] [35]. An alternative approach is to model the target speech signal instead, for example extracting acoustic features describing each time-frequency location and applying a Bayesian classier to label those as speech or noise [36]. Thirdly, one can take an auditory scene analysis approach, where spectral components are organized to their respective sound sources based on perceptual cues, such as harmonic frequency relationships and synchronous changes that promote the fusion of components to the same source [24], [37]. All these approaches are less straightforward to apply in music signals: The first because the interference caused by other instruments is not slowly-varying and does not represent a single source. The second because both the target and the interference represent the same, broad class of musical sounds. The third approach is complicated by the fact that music employs harmonic pitch relationships and synchronous onset times to fool the auditory system to perceive simultaneous sounds as a single entity [1, pp ]. The mask estimation algorithm proposed in the following is based on the assumption that the spectral envelopes of musical sounds tend to be smooth: slowly-varying as a function of log-frequency, in a specic sense [38]. The amplitude of an individual frequency partial can deviate negatively from the smooth envelope, but is very seldom much higher than those of its neighbors. In the latter case, the partial it is more easily perceptually segregated and perceived as a separate sound, which is undesirable for musical sounds [1]. Many instruments can be seen to consist of two acoustically coupled parts, a vibrating source (e.g., a string or an air column) and a resonator such as the guitar body. Usually the excitation signal to the vibrating system resembles a transient or an impulse train, resulting in a spectrum where no individual harmonic stands out. The coupled body resonator, in turn, does not usually have sharp resonance modes, but tends to be strongly damped and radiate acoustic energy efficiently [3, p.41]. Fig. 2. Spectrum of a Clarinet note andanoboenote which are in 3:4 pitch relationship. Amplitudes of the harmonic partials of both notes, taken from both the clean signals and the mixture have been highlighted. The partials that overlap have been marked with arrows. When two sinusoidal partials with magnitudes and and phase dference coincide in frequency, the amplitude of the resulting sinusoid is given by (22) If, the partials may either amply or cancel out each other, depending on. However, one of the amplitudes is signicantly larger than the other, as is usually the case, approaches the maximum of the two. As a consequence, partials that overlap with a more dominant one (from another source) tend to have higher magnitudes than their neighbors and rise above the smooth spectral envelope. Fig. 2 illustrates a situation where two sounds are in 3:4 pitch relationships and therefore every third partial of the higherpitched sound (marked with black arrows on the figure) overlaps every fourth partial of the lower sound. The partials of the lower-pitched sound dominate, except for the fourth partial for which the observed value ( ) is larger than the underlying clean value ( ). For the higher-pitched sound, the third partial is only slightly affected because of the much higher amplitude, but for the 6th, 9th, 12th and 15th partials, the observed amplitudes ( ) are clearly higher than the clean values ( ) and the partials rise above the rest of the observed series of harmonic amplitudes. The above observations suggest that individual partials with amplitudes clearly higher than their neighbors are more likely to have been corrupted by a partial from an interfering source, and the mask value at the corresponding position of the feature vector should be set to zero. That is the basic idea of the mask estimation procedure described in the following. The algorithm first estimates the smooth spectral envelope by calculating a local moving average over the series of observed harmonic amplitudes of a sound ( denotes the power of the partial as discussed in Section II-A). An octave-wide hamming window is centered at each harmonic partial, and a weighted average of the magnitudes of the partials within the window is calculated. 2 Fig. 3 illustrates the smooth spectra for two example isolated sounds. As can be seen, there are negative deviations from the smooth envelope (especially for the clarinet for which even harmonics are weak due 2 Note that for the first partial, an octave-wide window includes only the partial itself and therefore equals the observed amplitude of the partial.

GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1811 Fig. 3. The spectra of an Oboe sound (top) and a Clarinet sound (bottom).

7 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1811 Fig. 3. The spectra of an Oboe sound (top) and a Clarinet sound (bottom). The smoothed harmonic partial magnitudes have been highlighted with and are connected with line segments to produce the smooth envelope of the sound. to its sound production mechanism) but none of the harmonics rises much above the envelope. The smoothed magnitude spectrum values are then squared and are substituted for in (5) in order to get a feature vector. We then calculate the dference (23) where negative values are clipped to zero because we are interested only in positive deviations from the smooth spectral envelope. For learning the mask probabilities, we utilize an oracle mask: an underlying ideal mask. The oracle mask is available at the training stage by generating training sound mixtures for which we have the isolated (clean) sounds before mixing. We compute the feature vectors from the mixture, and in addition we compute the clean feature vectors by applying the same feature extraction procedure on the isolated sounds before mixing. The oracle mask is then defined as (24) where is an empirically found threshold value (in preliminary experiments, values 3 6 [db] were found suitable). The top panel of Fig. 4 shows an example spectrum consisting of the polyphonic mixture of four instruments. The observed partial magnitudes of an oboe sound have been indicated with and the underlying clean magnitudes with. The ideal (oracle) mask for this sound is shown as a series of 1 s and 0 s above the spectrum. The subband boundaries are indicated with dotted horizontal lines (note that subbands 1 4 contain a single partial only). The lower panel gives a similar example for another mixture of four sounds, highlighting the partials of a clarinet sound and the corresponding ideal mask. Based on the binary value of the oracle mask at each band, we compile two sets of histograms. is a histogram of values at subbands, counting only the training cases where the oracle mask at band. Separate histograms are calculated for each subband. Similarly, are histograms of values at subbands, counting only cases where the oracle mask. Based on the two histograms for each band, we can then calculate the empirical probability distributions for the bands: (25) Fig. 4. The upper panel shows the spectrum of a mixture of four sounds. The observed partial magnitudes of an oboe sound have been indicated with and the underlying clean magnitudes with. The ideal ( oracle ) mask for the sound is indicated with 1 s and 0 s above the spectrum and the subband boundaries are shown with dotted horizontal lines. The lower panel shows another example mixture, highlighting the partials of a clarinet sound and the corresponding ideal mask. Fig. 5. Illustration of the empirically estimated mask probabilities. Darker value indicates higher probability. Fig. 5 illustrates the estimated mask probabilities as a function of and. As can be seen, the probabilities as a function of are very informative for all the bands : as the value of increases, the probability decreases. 3 The estimated mask probabilities are common to all pitch values. In principle, the probabilities in (25) could be conditioned by and separate mask probabilities could be estimated for dferent pitch values. However this was found unnecessary in preliminary experiments. Finally, full mask probabilities are calculated by assuming independence of dferent subbands: (26) 3 Note that the value of at the few lowest bands varies only within a narrow range, and is always for band 1, where the smoothed amplitude equals the observed amplitude.

8 1812 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 where is a given mask, for which the probability is to be calculated. The probabilities are then substituted for in (21). Note that all the information required for mask estimation is included in which is obtained from (23) and the empirical probability densities (25) estimated at the training stage. The independence assumption in (26) is made for simplicity. The mask marginalization algorithm (to be described in the next subsection) would allow a more generic factorial form for the mask probabilities, namely (27) which leads to more accurate mask probabilities. However for simplicity we stick to (26) in the following. The estimated mask probabilities are used when marginalizing the mask, as will be described in the next subsection. Alternatively, we can also define a single, maximum-likelihood mask estimate as (28) When using the maximum-likelihood mask, the probability of is set to one and the probabilities of other masks to zero, making mask marginalization trivial as summing over masks is not needed in (21). Similarlyto(28),wecandefine yet another heuristic mask that does not require training the mask probabilities at all, but is directly based on the estimated value of. This heuristically estimated mask is denoted by and is given by (29) where the threshold value was chosen based on preliminary experiments. Again, when using the above mask, the probability of is set to one and the probabilities of other masks to zero, making the mask marginalization unnecessary. These simplications of the proposed system will be separately evaluated in Section III. F. Marginalizing the Mask What remains to be done is to device an algorithm that implements summing over dferent masks in (21). The algorithm proposed in the following is closely related to the forward algorithm used to evaluate hidden Markov models [39]. The idea of our algorithm is to go through all possible masks by adding unity values (clean subbands) to the mask one at a time, and proceeding towards higher subbands. For example, the lowest clean subband can be at any position between 1 and, the second clean subband can be at any position between 2 and, above the first one. At each step of adding one clean band to the mask, we cumulate probabilities up to that subband. At the end, we get the probability which is the goal of all the computations. In the following, we use to denote the partial probability where we have cumulated the probabilities, up to subband, over all masks that contain clean bands (non-zero values) Fig. 6. Illustration of the marginalization of the mask (cumulation of the probabilities until we get ), using example masks. The second step is repeated as many times as needed (see case b), and it can even be omitted (see case c) for masks with only three 1 s). See text for details. and have the two highest clean subbands at positions and. Since we cannot have a mask with only one clean band in our formulation, we start by adding two clean subbands (thus at the first step). After adding the first two clean bands to the mask, becomes (30) where we have used the short form in place of for convenience. The values are computed for and and are zero for other combinations of and. The factor represents the two subbands for which we select, and represent the subbands where we have. Interpretation of (30) becomes clear by comparing it with (21). Fig. 6 uses a few example masks to illustrate the mask marginalization (cumulation of the probabilities until we get ). Let s look at Fig. 6(a) for example. On the first step, using (30), two non-zero values have been added, therefore. In the example mask, and (positions of the first two clean subbands). We need to calculate (30) for dferent values of and to cover all possibilities. To write the equations further, we need a more specic notation, which we use in place of.from(16)wedefine: (31) where we have made it explicit that the terms actually depend on three dferent subbands, not just two. To calculate for, we always add one more non-zero value in the mask, above the previous ones (one more member in the set of clean subbands ). The accumulated probability after setting for subband is (32)

9 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1813 The values are computed for and and are zero for other combinations of and ( cannot be smaller than since there are clean subbands below ). Summing over takes place in (32) in order to cover dferent masks where the third-highest clean subband can occur in dferent positions. In Fig. 6, application of (32) is indicated as Step 2. Note that this step is repeated as many times as non-zero subbands are added, incrementing the value of (see Fig. 6(b) and (c)). Finally, when the last clean subband is added, we calculate the total probability over all masks that have clean subbands and have the two upmost clean subbands at positions and. The update formula is the same as (32), except that the product over goes up to instead of, in order to include the noisy subbands above the highest clean subband: (33) The dference between (32) and (33) stems from the fact that, in the former, one non-zero value at a time is added to set, whereas in the latter, one non-zero value is added and then the rest of the mask (any)isfilled with zeros (Step 3). The end probabilities represent sums over all masks that contain non-zero values and have the two highest clean subbands at positions and. The values are computed for and and are zero for other combinations of and. Finally, in order to get the probability where we have cumulated the probabilities over all possible masks, we need to sum over for all values of (number of clean subbands) and and (positions of the two highest clean subbands). This is given by (34) where the smallest possible value for is because there are clean subbands below. The above equation implements the marginalization of masks ( in (21)). III. RESULTS The proposed method and various baseline methods were tested using mixtures of musical instrument sounds. Recognition was performed within an individual 93 ms analysis frame, as our purpose was to reliably validate the proposed missing feature algorithms and not so much to maximize the absolute recognition performance by including temporal features and integrating the class probabilities over time. Two dferent databases were used, one for training the class models and another for testing, in order to ensure complete independence of the training and test data. For training the models, we employed the RWC Musical Instrument Sound database [40]. It contains several instances of each instrument (for example several violins) and dferent dynamic levels and playing styles (depending on the instrument) from each instrument instance. For testing, we used the McGill University Master Samples (MUMS) database [41] which contains the full pitch range of one instance of each instrument played with one dynamic level and normal playing style. Ten dferent instruments were chosen for the classication task, mainly based on the availability of data in the above-mentioned databases. These included bassoon, cello, clarinet, flute, oboe, piano, piccolo, alto saxophone, tuba and violin. The choice of the instruments was made before computing any classication results. 4 The probability densities for each class were trained using isolated samples from the chosen instruments, as described at the end of Section II-D. We used all the available training data (all instrument instances, playing styles and dynamic levels) by sampling randomly from the database until we had collected analysis frames for each instrument. The analysis frames were 93 ms in duration and were chosen near the beginning (onset) of each randomly-selected note. The test mixtures were obtained by choosing random notes from random instruments in the test database and mixing them with equal mean-square levels one-, two-, and four-sounds mixtures were generated for testing and the recognition was carried out within a single 93 ms analysis frame near the onset of each mixture. Importantly, we had to constrain the test mixtures so that each instrument appears only once in a given mixture, excluding multiple notes from the same instrument in the same mixture signal. This information, along with the number of component sounds in the mixture ( polyphony ), was given as side-information to the classiers. This was an unavoidable step due to the way the classication is performed by the baseline methods where the classier simply chooses most probable classes to the output. This has the consequence that the random guess rate for isolated sounds is only 10%(one out of 10 instruments), whereas the random guess rate for a mixture of four sounds is 40%(guessing 4 out of 10 instruments). This makes the absolute recognition rates slightly less meaningful, but does not prevent from comparing the proposed methods against the baseline and assessing the relative improvement achieved. Pitch information was not given as side-information, but the multipitch estimator [26] was applied as described in the beginning of Section II. A. Reference Methods In order to put the results in perspective, we compared the proposed method with various reference classiers and acoustic features. As reference features, we employed Mel-frequency cepstral coefficients (MFCCs), which have been widely used for musical instrument recognition [9], speech recognition [42], and speaker verication [43]. The zeroth coefficient was discarded and the following 12 coefficients were used for classication. The MFCC features were mean-and variance-normalized so that each element of the feature vector has zero-mean and unity variance over all the training data from all classes. The global mean and variance parameters were stored and used for normalizing 4 With the exception of alto saxophone which was chosen to replace trumpet because poor results for the baseline methods indicated that the trumpet types in the two databases were dferent ( vs. C trumpet).

10 1814 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 the test features too. This improved the classication performance, albeit not signicantly. To study how much information is lost in the proposed method by retaining only the harmonic partials of each sound instead of the entire spectrum (see Section II-A), we computed another set of features, here called MFCC-H, based on the harmonic partials only. More exactly, we computed the discrete Fourier transform (DFT) of the analysis frame applying the hamming window and zero-padding factor of two, picked local maxima in the vicinity frequency bins of integer multiples of the estimated pitch, and then set the rest of the spectrum to zero, retaining only individual frequency bins at the positions of the partials. The resulting spectrum was then used for calculating the MFCC coefficients in a normal manner. The number of coefficients used and the normalization applied were the same as for the standard MFCC features described above. As a reference classier, we employed a Bayesian classier using Gaussian mixture models (GMMs) to represent the classconditional likelihood densities. We tried dferent GMM model orders and settled for 10 Gaussians per model, with diagonal covariance matrices. Both MFCC and MFCC-H features were used in conjunction with this classier. As another baseline classier, we replaced the GMMs with simple Gaussian models that were trained separately for each pitch value (musical note). This model is termed Gaussian/ in the following to indicate we have one Gaussian per pitch. In order to have a sufficient amount of training data for each pitch value, we included data from notes around the target note since the spectra of nearby pitches is usually representative too. In preliminary experiments, we found to perform well. At the test stage, the pitches of the component sounds are estimated (similarly to the proposed method) and then used to choose the corresponding pitch-specic model from each class to calculate class membership probabilities. The classier with pitch-specic models was tested using MFCC, MFCC-H, and the features proposed in Section II-A. Diagonal covariance matrices were used in all cases. B. Parameters of the Proposed Method Acoustic features for the proposed method were extracted as described in Section II-A, setting so that we use only the lowest 30 harmonic partials. The partials were located in the spectrum by searching for maxima within frequency bins around the integer multiples of the estimated pitch (similarly to MFCC-H described above). We analyzed the spectrum only up to 10 khz and set harmonic amplitudes to zero above that since most instruments do not exhibit clear sinusoidal components above that. When mapping from linear to logarithmic frequency resolution, we used third-octave resolution, setting in (5). Finally, as described in Section II-D, in order to achieve scale invariance we used level dferences between neighboring subbands as classication features, termed as subband level dferences (SLD) in the remainder of the paper. The trained models for the proposed algorithm were always pitch-specic. Similarly to the second baseline system, we utilized notes around each target note for training, setting. C. Results TABLE I RECOGNITION ACCURACY (%) OF DIFFERENT METHODS Table I shows results for dferent configurations of the proposed method and the reference methods. The presented accuracies represent the average performance of 5 runs, each with dferent instantiations of the training and the test sets. The first row of results in Table I shows the random guess rate, which increases along with the polyphony for the reason explained above. The main observation in Table I regards the performance dference between the baseline method (with bold font on the second row) and the proposed method on the last row. The proposed method outperforms the baseline by a wide margin for polyphonies 2 and 4, indicating that the taken missing feature approach provides signicant robustness improvement in mixture signals. The performance dference remains when switching to the baseline, indicating that utilizing the pitch information to pick the corresponding partials from the spectrum does not alone explain the performance dference. For isolated samples, however, the proposed method performs clearly worse than the baseline. The main reason seems to be that the proposed features are based on the amplitudes of harmonic partials only, discarding the spectrum between the partials and being subject to pitch estimation errors. This conclusion can be drawn by comparing the 2nd and 3rd rows in Table I, where the performance of the GMM baseline drops from 74.6% to 62.3% for isolated samples when moving from MFCC to MFCC-H features. Utilizing the stochastic spectral components between the harmonics in polyphonic music is dficult due to the low level of these components. 5 Of course, in practice the recognition would notbebasedonasingleanalysisframeorevenasinglenote and therefore the recognition rates for solo recordings can be considerably improved even with the harmonic part only. Rows 4 6 of Table I show results for the second baseline classier with pitch-specic Gaussian models ( Gaussian/ ). Interestingly, this classier performs essentially equally well as the GMM-based classier, suggesting that the use of a simple Gaussian per note (as is done in the proposed method) is not problematic. Both the GMM and the Gaussian/ methods were 5 However see [19] where the authors used both the harmonic part and the attack transient for polyphonic instrument recognition.

GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1815 trained on polyphonic material, which was found to perform signicantly better than the same models trained on isolated

11 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1815 trained on polyphonic material, which was found to perform signicantly better than the same models trained on isolated samples. For example the performance of was on the average 15 and 9%-units worse for polyphonies of 2 and 4, respectively, when trained on isolated samples. The performance dference for dferent features on rows 4 6 is quite illuminating. The dference between MFCC and MFCC-H features only confirms what was already discussed above for the GMM-based classier. However it is interesting that the MFCC-H and the SLD perform nearly equally well, despite the fact that the proposed features are correlated and diagonal covariance matrices were used. We tested decorrelating the SLD using principal component analysis (PCA, [44]) before employing them in the Gaussian/ -based classier, and separately the use of full covariance matrices, but the results were very similar. Important to note is that the proposed features given to the Gaussian/ -based classier consisted of level dferences between successive subbands, resulting in a feature vector that is one element shorter than the vector described in Section II-A. This partly decorrelates the features. Rows 7 11 of Table I show results for dferent configurations of the proposed method. Five dferent types of masks weretested.inadditiontotheproposedmaskmarginalization explained in Section II-F ( Mask probs. on row 12), we employed the maximum-likelihood mask (row 11) and the heuristic mask (row 10) described at the end of Section II-E. As a simple baseline mask, we used an all-one mask and set its probability to one (row 9). Furthermore, in order to obtain an upper bound for the performance from the viewpoint of mask estimation, we calculated the ideal oracle mask by utilizing the test samples before mixing and set its probability to one (row 7). It is worth mentioning that by passing in to the method the ground truth F0s at the testing stage, thus suppressing any error propagation due to incorrect F0 estimation, the oracle mask performance improves by an extra 2.5 and 2%-units for polyphonies of 2 and 4, respectively. The performance dference between the oracle and all-one mask is quite drastic for polyphonies 2 and 4, highlighting the importance of handling unreliable data appropriately in the classication process. Results for the estimated mask probabilities (last row of Table I) are approximately half-way between the oracle mask and the all-one mask, indicating that the spectral smoothness-basedmaskestimationisabletomakeanimportant step towards the ideal mask. Interestingly, this is true even in the case of isolated samples, for which the oracle and the all-one mask are equivalent: here the results are identical for the oracle, the all-one, and the estimated mask. The maximum-likelihood mask (row 11) and even the heuristic mask (row 10) perform surprisingly well, although slightly worse than the full system on row 12. This indicates that the mask marginalization algorithm described in Section II-F could be avoided in computationally restricted applications. To study the dference between bounded integration and full marginalization, we calculated results for the oracle mask using both options (rows 7 and 8 of Table I). In the case of full marginalization, the integral over is calculated from to instead of to, which has the consequence that the noise terms become one and need not be computed at all. Confirming the results of other authors [24], bounded TABLE II RECOGNITION ACCURACY (%) PER INSTRUMENT OF DIFFERENT METHODS IN POLYPHONY OF 4 Fig. 7. Confusion matrix for the proposed method. The rows correspond to the present instrument and the columns to the recognized instrument [AK: check]. Each entry of the matrix shows mean accuracy and standard deviation over 5 runs for a polyphony of 1 to highlight the confusions among instruments. marginalization works consistently (although not dramatically) better than full marginalization: assuming that the observed unreliable value gives an upper bound for the unobserved clean value is meaningful. Table II presents results per instrument for the polyphony of 4 and for a subset of the methods. As can be seen, the dferences between instruments are rather large. This is partly explained by the fact that 1) some of the instruments are more easily confused with each other and/or 2) for some instruments, the training data is not representative of the test data due to dferences in the two databases. Interestingly, it seems that the proposed method performs worse for high-pitched instruments such as altosax, piccolo, and flute. One possible explanation is that the number of partials available below 10 khz is not sufficient for robust recognition, especially some of those partials are partly obscured due to other overlapping sounds. Fig. 7 shows a confusion matrix for the proposed method and the polyphony of 1. The results on the diagonal further support the observation that the proposed method works somewhat worse for higher-pitched instruments.

12 1816 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 21, NO. 9, SEPTEMBER 2013 IV. CONCLUSIONS A novel method was proposed for musical instrument identication in polyphonic music signals. The method is based on the missing feature approach and local spectral features, using bounded marginalization to treat the unreliable feature vector elements. A mask estimation technique was proposed that is based on the assumption that the spectral envelopes of musical sounds tend to be slowly-varying as a function of log-frequency. A computationally efficient algorithm for marginalizing the mask was described. In mixture signals, the proposed method outperformed the reference methods by a wide margin, indicating that the missing feature approach provides a signicant robustness improvement when processing polyphonic audio. For isolated samples, the proposed method performed somewhat worse than the reference method, which seems to be due to the fact that information only at the positions of the harmonic partials is utilized and the rest of the spectrum is discarded. Dferent masks were tested with the proposed method and their performance was compared with that of a trivial all-one mask and an ideal oracle mask. The estimated masks achieve an accuracy that is approximately half-way between that of the all-one and the oracle masks, indicating that the proposed mask estimation principle is efficient. The estimated maximum-likelihood mask performs surprisingly well too, making it a viable option to avoid the mask marginalization step altogether in computationally limited applications. Future work will include augmenting the model with temporal features, integrating frame-wise class probabilities over time, and extending the mask estimation process to include additional information. APPENDIX denote a multivariate normal random vari- and covariance matrix Let able with mean The conditional distribution of given is normally distributed with mean and variance (35) (36) (37) REFERENCES [1] A. Bregman, Auditory Scene Analysis. Cambridge, MA, USA: MIT Press, [2] M. Cooke, P. Green, and M. Crawford, Handling missing data in speech recognition, in Proc.3rdInt.Conf.SpokenLang.Process., 1994, pp [3] N. Fletcher and T. Rossing, The Physics of Musical Instruments. Berlin, Germany: Springer-Verlag, [4] A. Klapuri and M. Davy, Signal Processing Methods for Music Transcription. New York, NY, USA: Springer-Verlag, [5] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri, Automatic music transcription: Breaking the glass ceiling, in Proc. 13th Int. Soc. for Music Inf. Retrieval Conf., Oct [6] M. Müller, D. Ellis, A. Klapuri, and G. Richard, Signal processing for music analysis, IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, pp , Oct [7] C.Joder,S.Essid,andG.Richard, Temporal integration for audio classication with application to musical instrument classication, IEEE Trans. Audio, Speech, Lang. Process., vol. 17, no. 1, pp , Jan [8] T. Kitahara, M. Goto, K. Komatani, T. Ogata, and H. G. Okuno, Instrument identication in polyphonic music: Feature weighting to minimize influence of sound overlaps, EURASIP J. Appl. Signal Process., vol. 2007, pp. 1 15, [9] P. Herrera-Boyer, A. Klapuri, and M. Davy, Automatic classication of pitched musical instrument sounds, in Signal Processing Methods for Music Transcription, A.KlapuriandM.Davy,Eds. NewYork, NY, USA: Springer, 2006, pp [10] D. Little and B. Pardo, Learning musical instruments from mixtures of audio with weak labels, in Proc. 9th Int. Symp. Music Inf. Retrieval (ISMIR), Philadelphia, PA, USA, [11] S. Essid, G. Richard, and B. David, Instrument recognition in polyphonic music based on automatic taxonomies, IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 1, pp , Jan [12] A. A. Livshin and X. Rodet, Musical instrument identication in continuous recordings, in Proc. Int. Conf. Digital Audio Effects, Naples, Italy, [13] B. Kostek, Musical instrument recognition and duet analysis employing music information retrieval techniques, Proc. IEEE, vol. 92, no. 4, pp , Apr [14] L. G. Martins, J. J. Burred, G. Tzanetakis, and M. Lagrange, Polyphonic instrument recognition using spectral clustering, in Proc. 8th Int. Symp. Music Inf. Retrieval, Vienna, Austria, [15] J. J. Burred, A. Röbel, and T. Sikora, Polyphonic musical instrument recognition based on a dynamic model of the spectral envelope, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Taipei, Taiwan, 2009, pp [16] T. Heittola, A. Klapuri, and T. Virtanen, Musical instrument recognition in polyphonic audio using source-filter model for sound separation, in Proc. 10th Int. Soc. for Music Inf. Retrieval Conf. (ISMIR), Kobe, Japan, 2009, pp [17] E. Vincent and X. Rodet, Instrument identicationinsoloandensemble music using independent subspace analysis, in Proc. 5th Int. Symp. Music Inf. Retrieval, Barcelona, Spain, 2004, pp [18] K. Itoyama, M. Goto, K. Komatani, T. Ogata, and H. Okuno, Simultaneous processing of sound source separation and musical instrument identication using Bayesian spectral modeling, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2011, pp [19] J. Wu et al., Polyphonic pitch estimation and instrument identication by joint modeling of sustained and attack sounds, IEEE J. Sel. Topics Signal Process., vol. 5, no. 6, pp , Oct. 2011,. :. [20] P. Leveau, E. Vincent, G. Richard, and L. Daudet, Instrument-specic harmonic atoms for mid-level music representation, IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp , Jan [21] J.EgginkandG.J.Brown, Amissingfeatureapproachtoinstrument identication in polyphonic music, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., Hong Kong, China, 2003, pp [22] B. Raj and R. M. Stern, Missing-feature approaches in speech recognition, IEEE Signal Process. Mag., vol. 22, no. 5, pp , Sep [23] J. Barker, Missing data techniques: Recognition with incomplete spectrograms, in Techniques for Noise Robustness in Automatic Speech Recognition, T.Virtanen,R.Singh,andB.Raj,Eds. New York, NY, USA: Wiley, [24] J. Barker, M. Cooke, and D. P. W. Ellis, Decoding speech in the presence of other sources, Speech Commun., vol. 45, no. 1, pp. 5 25, 2005.

GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1817 [25] J. G. A. Barbedo and G. Tzanetakis, Musical instrument classication using individual partials, IEEE Trans.

216 221, 2006. [27] A. Eronen, Comparison of features for musical instrument recognition, in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., New Paltz, NY, USA, 2001, pp. 19 22. [28] G.

13 GIANNOULIS AND KLAPURI: MUSICAL INSTRUMENT RECOGNITION IN POLYPHONIC AUDIO 1817 [25] J. G. A. Barbedo and G. Tzanetakis, Musical instrument classication using individual partials, IEEE Trans. Audio, Speech, Lang. Process., vol. 19, no. 1, pp , Jan [26] A. Klapuri, Multiple fundamental frequency estimation by summing harmonic amplitudes, Proc. ISMIR, vol. 6, pp , [27] A. Eronen, Comparison of features for musical instrument recognition, in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., New Paltz, NY, USA, 2001, pp [28] G. Peeters, A Large set of audio features for sound description (similarity and classication) in the CUIDADO project,. Paris, France, IRCAM, Apr. 2004, Tech. Rep.. [29] P. Smaragdis and J. C. Brown, Non-negative matrix factorization for polyphonic music transcription, in Proc. IEEE Workshop Applicat. Signal Process. Audio Acoust., 2003, pp [30] T. Virtanen, Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria, IEEE Trans. Audio, Speech, Lang. Process., vol. 15, no. 3, pp , Mar [31] C. Févotte, N. Bertin, and J.-L. Durrieu, Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis, Neural Comput., vol. 21, no. 3, pp , [32], T. Virtanen, R. Singh, and B. Raj, Eds., Techniques for Noise Robustness in Automatic Speech Recognition. New York, NY, USA: Wiley, [33] A. Vizinho, P. Green, M. Cooke, and L. Josovski, Missing data theory, spectral subtraction and signal-to-noise estimation for robust ASR: An integrated study, in Proc. Eurospeech, Budapest, Hungary, 1999, pp [34] M. El-Maliki and A. Drygajlo, Missing features detection and handling for robust speaker verication, in Proc. Eurospeech, Budapest, Hungary, 1999, pp [35] V. Stahl, A. Fischer, and R. Bippus, Quantile based noise estimation for spectral subtraction and wiener filtering, in Proc. Int. Conf. Acoust., Speech, Signal Process., Istanbul, Turkey, 2000, pp [36]M.Seltzer,B.Raj,andR.Stern, A Bayesian classier for spectrographic mask estimation for missing feature speech recognition, Speech Commun., vol. 43, no. 4, pp , [37] D. Wang and G. J. Brown, Computational Auditory Scene Analysis: Principles, Algorithms and Applications. New York, NY, USA: Wiley-IEEE Press, [38] A. Klapuri, Multipitch estimation and sound separation by the spectral smoothness principle, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2001, pp [39] L. R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, no. 2, pp , Feb [40] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, Rwc music database: Popular, classical, and jazz music databases, Proc. ISMIR, vol. 2, pp , [41] F. Opolko and J. Wapnick, Mcgill University Master Samples. Montreal, QC, Canada: McGill Univ., Faculty of Music, [42] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing. Upper Saddle River, NJ, USA: Prentice-Hall, [43] D. A. Reynolds, T. F. Quatieri, and R. B. Dunn, Speaker verication using adapted Gaussian mixture models, Digital Signal Process., vol. 10, pp , [44] H. Abdi and L. Williams, Principal component analysis, Wiley Interdisciplinary Rev.: Comput. Statist., vol. 2, no. 4, pp , [45] J. T. Astola and P. Kuosmanen, Fundamentals of Nonlinear Digital Filtering. Boca Raton, FL, USA: CRC, Dimitrios Giannoulis (S 11) received the B.Sc. degree in Physics from the National University of Athens, Greece, where he specialized, among others, on signal processing and acoustics. He received the M.Sc. degree in digital music processing in 2010 from Queen Mary University of London, London, U.K. He is currently, pursuing a Ph.D. degree in electronic engineering at the Centre for Digital Music (C4DM) at Queen Mary university of London. His main research areas of interest are machine learning, audio signal processing, computational auditory scene analysis and music information retrieval. Anssi Klapuri (M 06) received his Ph.D. degree from the Tampere University of Technology (TUT), Tampere, Finland, in He visited as a post-doc researcher at Ecole Centrale de Lille, France, and Cambridge University, UK, in 2005 and 2006, respectively. He worked as a Lecturer at the Centre for Digital Music at Queen Mary University of London, London, U.K., in He is currently CTO at Ovelin, Finland, and Associate Professor at TUT. His research interests include audio signal processing, auditory modeling, and machine learning.

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu