Music Source Separation

Similar documents
Voice & Music Pattern Extraction: A Review

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

Robert Alexandru Dobre, Cristian Negrescu

Keywords Separation of sound, percussive instruments, non-percussive instruments, flexible audio source separation toolbox

Automatic music transcription

2. AN INTROSPECTION OF THE MORPHING PROCESS

Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music

EE391 Special Report (Spring 2005) Automatic Chord Recognition Using A Summary Autocorrelation Function

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

Calibrate, Characterize and Emulate Systems Using RFXpress in AWG Series

Getting Started with the LabVIEW Sound and Vibration Toolkit

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

NON-UNIFORM KERNEL SAMPLING IN AUDIO SIGNAL RESAMPLER

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Query By Humming: Finding Songs in a Polyphonic Database

Speech and Speaker Recognition for the Command of an Industrial Robot

An Effective Filtering Algorithm to Mitigate Transient Decaying DC Offset

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Drum Sound Identification for Polyphonic Music Using Template Adaptation and Matching Methods

Music Information Retrieval with Temporal Features and Timbre

Single Channel Vocal Separation using Median Filtering and Factorisation Techniques

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Experiments on musical instrument separation using multiplecause

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

How to Obtain a Good Stereo Sound Stage in Cars

Lecture 9 Source Separation

Dither Explained. An explanation and proof of the benefit of dither. for the audio engineer. By Nika Aldrich. April 25, 2002

THE importance of music content analysis for musical

CSC475 Music Information Retrieval

Single Channel Speech Enhancement Using Spectral Subtraction Based on Minimum Statistics

Book: Fundamentals of Music Processing. Audio Features. Book: Fundamentals of Music Processing. Book: Fundamentals of Music Processing

Voice Controlled Car System

Upgrading E-learning of basic measurement algorithms based on DSP and MATLAB Web Server. Milos Sedlacek 1, Ondrej Tomiska 2

PHYSICS OF MUSIC. 1.) Charles Taylor, Exploring Music (Music Library ML3805 T )

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

UNIVERSITY OF DUBLIN TRINITY COLLEGE

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

Singer Recognition and Modeling Singer Error

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Spectrum Analyser Basics

Signal to noise the key to increased marine seismic bandwidth

Informed Source Separation of Linear Instantaneous Under-Determined Audio Mixtures by Source Index Embedding

Investigation of Digital Signal Processing of High-speed DACs Signals for Settling Time Testing

MUSIC TRANSCRIPTION USING INSTRUMENT MODEL

ECG SIGNAL COMPRESSION BASED ON FRACTALS AND RLE

Hybrid active noise barrier with sound masking

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

DAT335 Music Perception and Cognition Cogswell Polytechnical College Spring Week 6 Class Notes

Pitch-Synchronous Spectrogram: Principles and Applications

ON FINDING MELODIC LINES IN AUDIO RECORDINGS. Matija Marolt

Hugo Technology. An introduction into Rob Watts' technology

The Physics Of Sound. Why do we hear what we hear? (Turn on your speakers)

Music Radar: A Web-based Query by Humming System

Audio-Based Video Editing with Two-Channel Microphone

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

Music Genre Classification and Variance Comparison on Number of Genres

Module 8 : Numerical Relaying I : Fundamentals

AN ARTISTIC TECHNIQUE FOR AUDIO-TO-VIDEO TRANSLATION ON A MUSIC PERCEPTION STUDY

Soundprism: An Online System for Score-Informed Source Separation of Music Audio Zhiyao Duan, Student Member, IEEE, and Bryan Pardo, Member, IEEE

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Study of White Gaussian Noise with Varying Signal to Noise Ratio in Speech Signal using Wavelet

Vibration Measurement and Analysis

Application Of Missing Feature Theory To The Recognition Of Musical Instruments In Polyphonic Audio

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

Dithering in Analog-to-digital Conversion

/$ IEEE

Lecture 1: What we hear when we hear music

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

Subjective Similarity of Music: Data Collection for Individuality Analysis

Topic 10. Multi-pitch Analysis

Advanced Signal Processing 2

Linear Time Invariant (LTI) Systems

REpeating Pattern Extraction Technique (REPET): A Simple Method for Music/Voice Separation

Automatic Rhythmic Notation from Single Voice Audio Sources

Digital Signal. Continuous. Continuous. amplitude. amplitude. Discrete-time Signal. Analog Signal. Discrete. Continuous. time. time.

Spectral Sounds Summary

Musical Sound: A Mathematical Approach to Timbre

EE-217 Final Project The Hunt for Noise (and All Things Audible)

Removal of Decaying DC Component in Current Signal Using a ovel Estimation Algorithm

Pitch Perception and Grouping. HST.723 Neural Coding and Perception of Sound

Removing the Pattern Noise from all STIS Side-2 CCD data

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

International Journal of Computer Architecture and Mobility (ISSN ) Volume 1-Issue 7, May 2013

On Figure of Merit in PAM4 Optical Transmitter Evaluation, Particularly TDECQ

Lab P-6: Synthesis of Sinusoidal Signals A Music Illusion. A k cos.! k t C k / (1)

Tempo and Beat Analysis

Adaptive Key Frame Selection for Efficient Video Coding

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

LEARNING TO CONTROL A REVERBERATOR USING SUBJECTIVE PERCEPTUAL DESCRIPTORS

DATA COMPRESSION USING THE FFT

Music Segmentation Using Markov Chain Methods

White Paper JBL s LSR Principle, RMC (Room Mode Correction) and the Monitoring Environment by John Eargle. Introduction and Background:

REPORT DOCUMENTATION PAGE

PICK THE RIGHT TEAM AND MAKE A BLOCKBUSTER A SOCIAL ANALYSIS THROUGH MOVIE HISTORY

Auto-Tune. Collection Editors: Navaneeth Ravindranath Tanner Songkakul Andrew Tam

Laboratory Assignment 3. Digital Music Synthesis: Beethoven s Fifth Symphony Using MATLAB

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

Digital music synthesis using DSP

Techniques for Extending Real-Time Oscilloscope Bandwidth

Transcription:

Music Source Separation Hao-Wei Tseng Electrical and Engineering System University of Michigan Ann Arbor, Michigan Email: blakesen@umich.edu Abstract In popular music, a cover version or cover song, or simply cover, is a new performance or recording of a previously recorded, by someone other than the original artist. However, it is impossible to retrieve a piece of single track for most of people. Therefore, my goal is to deliver a program that separates a record into several tracks, each corresponding to a meaningful source, which can be used for cover artists to facilitate their performance. I. INTRODUCTION Cover artists on Youtube have recently become increasingly popular. However, in order to make cover music, these artists have to acquire partial records. For example, a cover singer would sing with an off vocal version of the song; an accompaniment artist would play with a particular instrument removed from the original performance. Some off vocal tracks are released with the albums, which makes easy to acquire. However, in most cases, popular songs are not released with an off vocal version. Furthermore, tracks performed without certain instruments are hardly found in public market. These tracks are sometimes available in special cases. In result, most cover artists have to come up with their own solutions. One way to do so is to generate every track of a piece of music. This requires fundamental training in music, which is inaccessible to major public. As a result, I am going to provide a program which is able to separate the vocal and off-vocal tracks out. Fig...8.6.4...4.6.8 3 4 5 6 x 5 Time domain of a music piece 7 6 5 4 3 II. BACKGROUND & DIFFICULTIES In my experiment, I focus on a solo singer with multiple instruments. Fig. Fig. That is, I assume my music pieces have no more than one vocal components with none or several off-vocal components in it. When looking onto the figures, the plot of time domain Fig.looks like noise that has little use for my experiment. Therefore, I am going to analyze the signal of its frequency domainfig.. It looks like the signal are mixed in the center. However, it s hard to tell vocal from off-vocal part, if I do the analysis on it directly. Worst of all, the frequncy of vocal and off vocal must have a portion of overlap. It makes it impossible to use a filter to separate the two part out. Fortunately, I can use one of the machine learning techniques for this problem Blind Source Separation(BSS) BSS is a useful and power technique for this kinds of problem. It is a technique to separate different sources from a set to mixtures without the prior knowledge of the source nor the way they are mixed. With this advantage, BSS is one of the most powerful algorithms to my problem. I include Independent Component Analysis(ICA) and Degenerate Unmixing Estimation Technique(DUET) for this project. Fig.. A. ICA 3 4 5 6 Frequency domain of a music piece ICA finds the independent components by maximizing the statistical independence of the estimated components. As a result, ICA is one of the most popular method in BSS, and is known for its application to separate mixtures of speech signals by taking the advantage of tracking the potential components blindly. I applied the FastICA toolbox provided by []. However, the number of output sources is limited by this approach (as formula below). ŝ = W x A x x 5

ICA needs more observations than independent components. But still, this algorithm is really good to separate the vocal and off-vocal. According to the formula, ICA can separate out estimated independent sources, which is no more than the numbers of observation that I provided, left voice and right voice. As a result, the output are like vocal and off-vocal part respectively. But still it contains some noise. B. DUET DUET separates degenerate mixtures is by partitioning the time frequency representation of one of the mixtures. In other words, DUET assumes the sources are already separate in the time-frequency plane, the sources are disjoint. The demixing process is then simply a partitioning of the time frequency plane. Although the assumption of disjointness may seem unreasonable for simultaneous speech, it is approximately true. By approximately, it means that the time-frequency points which contain significant contributions to the average energy of the mixture are very likely to be dominated by a contribution from only one source. Stated another way, two people rarely excite the same frequency at the same time. In this assumption, I can separate sources into several pieces. A blind source separation problem is considered degenerated when the number of observations is less than that of the actual sources. In this sense, it is able to be used to separate more components out from the pieces. Traditional separation techniques such as ICA cannot solve such problems. However, DUET can blindly separate an arbitrary number of sources given just two anechoic (non-echonic) mixtures provided the time-frequency representations of the sources do not overlap too much [3]. With this advantages, DUET is able to separate more components out with better quality. In some sources, by implement [], it provides a good result like 4. In 4, the sources are two pieces of record of speech and result is perfectly estimated the speech components. It is able to assume that the speech components are well-anechoic. However, if it is not, that is, if the sources are mixture of instruments or with vocals, the output would be less usable and less acceptable 5 6. In 5 6, these two figures imply that DUET algorithm performs worse when sources are mixed of vocal and off-vocal tracks. When it is pure off-vocal part, as 6, there are two less mixing components in the plot (as noted by cursor), which are exactly two drums sources as checking manually. For 5, the plot contains several different pulses, which shows the drawbacks of DUET. C. CQT Constant-Q Transform (CQT) has the similar idea as Fourier transform, but CQT is a logarithm scale of Fourier transform [4]. The following is the definition of CQT, where x[n] is the time domain signal, X[n] the frequency domain coefficient. X[k] = N[k] W [k, n]x[n]e jπqn N[k] N[k] n= W is a window function used to reduce aliasing effects near the maximum frequency. it is also used to isolate the signal to a short time period. The parameters are defined as the following. N[k] = f s δf k = Q f s f k, δf k = ( b ) k f)min f s is the sample rate and f min is the minimum frequency. f k is the center frequency of the kth coefficient. As the Discrete Fourier Transform(DFT) can be viewed as a series of filter banks, CQT can also be viewed as a series of exponentially spaced filters. In contrary to the linear resolution Fourier transform has, CQT has logarithm resolution in the frequency domain. Since the musical notes are spaced exponentially across each octave, CQT can linearly map the musical scales. This provided me an alternative way to map musical signals onto the time-frequency domain so that the instruments do not overlap too much. I would like to implement an iterative CQT-based source separation algorithm to identify each instrument in an excerpt. Fig.3 shows the system diagram of the algorithm. Fig. 3. System diagram of the proposed algorithm. First, transform the input signal into the time-frequency domain by short-time CQT. This results in a spectrum of the original signal. The lowest harmonic within each timeslot is then traced on the spectrum. Once I locate the lowest harmonic, I expand and isolate the spectrum around those harmonic. This is called trace expansion. I then cluster the power spectrum of the trace. The lowest frequency cluster is extracted as the first instrument. After removing the signal of the first instrument from the observation, I repeat the whole procedure until no more instrument can be extracted. I use a bandpass filter for extraction. A. ICA III. RESULT I chose several song excerpts of different genre as input, each lasting about seconds. The two channels are passed as the observation to the FastICA toolbox. The output contains separated signals. By listening, it is able to identify one source as the off-vocal version[5] of the original excerpt[6]. Most of the vocal parts are removed. The other source contains the vocal parts and some accompaniments. There is little distortion in both separated signals. B. DUET Fig.4 shows the time-frequency representation of a record provided by with 4 people speaking concurrently [7]. It obvious to identify that there are 4 disjoint peaks in the histogram, and, as expected, the corresponding reconstruction[8] of the 4 sources is clearly understandable. Fig.5 shows the time-frequency representation of a pop music [6] excerpt. The histogram is more spread than that

of a speech signal, or I can say, they are less disjoint in this representation. The result is a poorer quality of reconstruction. I choose the four largest peaks as the center of the mask. The reconstructed signals, containing a lot of distortion noise, are hardly identifiable by human ears. Only the signal filtered from the main peaks contains recognizable voice. Fig.6 shows the time-frequency representation of an electronic music[9] excerpt, where no voice presents. There are two peaks and hence two sources. The reconstructed signals identify the side drum and the [] respectively. The sound of the rest of the instruments is still highly distorted in the separated signals. While DUET separates speeches from different people successfully, it performs poorly on separating vocal signal from the accompaniment and separating different instruments. DUET relies on the sources to be disjoint in the time-frequency domain, which is generally true for speech signals. However, this is not true for musical performances. In speech signals, only vowels contain concentrated power and consonants are merely white Gaussian noise, which has no significance in the frequency domain. Furthermore, vowels do not appear continually, resulting a highly disjoint time-frequency representation. On the other hand, musical instruments are often played continually, and moreover, the frequency components are much more complicated. Pitched musical instruments are often based on an approximate harmonic oscillator such as a string or a column of air, which oscillates at numerous frequencies simultaneously. The signal power are spread in each octave, giving a wide spread spectrum overlapping each other in the time-frequency representation. This also explained why the drums are separable by DUET. Since they are not pitched and percussion instruments are not played continually, they resemble speech signals in the time-frequency representation. x 5 Fig. 5. Fig. 6. DUET x 4 9 8 7 6 5 4 3.5 4.5 4 Estimated independent components of pop music by DUET x 5 3.5 3 X:.7347.5 Y: Z: 3.445e+5 X:.7347 Y: Z: 3.4e+5.5.5 4 3 3 4 Estimated independent components of electronic instruments by.5 X:.955 Y:.35 Z:.46e+5 X:.955 Y:.48 Z:.93e+5.5 X:.3673 Y: Z:.66e+5 X:.3673 Y:.835 Z:.549e+5.5.5.5 4 3 3 4 Fig. 4. Estimated independent components of speech by DUET Fig. 7. Spectrum of the original classical music except by CQT C. CQT Fig.7 shows the time-frequency representation of a classical excerpt []. It can be recognized there are more than three major instruments and other harmonic wave. My method is to filter out each main instrument by tracing the energy, and then use k-means to cluster and select the result. For example, the filter of the first estimation is like Fig. 8 and the corresponding result is Fig. 9. Fig. shows the time-frequency representation of an estimated double bass []. I can tell this is the second trace corresponding to the original plot. There are some harmonic

3 4 59 84 8 67 36 Masker TABLE I. SNR COMPARISON OF DUET AND PROPOSED ALGORITHM SNR of source SNR of source Proposed algorithm 5.46dB 4.65dB DUET.9dB -5.3dB frequency [Hz] 334 473 669 946 337 be caused by the distortion of overtone of other instrument. 89 675 3783 535 7565 699..4.6.8..4.6.8. time [sec] x 4 Fig. 8. First mask Fig.. Spectrum of a separated instrument flute I also evaluated separation algorithm by comparing the signal-to-noise ratio(snr) of the reconstructed sources of each separation techniques. Since the original tracks of each instrument in commercial releases are unavailable, generate a short piece of music [4] with instruments for this experiment. Table I shows the SNR of the reconstructed signal. The noise defined to be the distortion of the reconstructed source. My algorithm is about 3dB better than DUET only, showing that CQT can better capture the features of musical instruments. Fig. 9. First estimated instrument waves, which is the genre of overtone ejected by double bass. By listening, it is clear double bass sound without distortion. Fig.. Spectrum of a separated instrument double bass Fig. shows the time-frequency representation of an estimated flute [3]. It can be recognized this is the top trace corresponding to original plot. There are some harmonic waves, which is overtone both from itself and other instruments. By listening, it is a flute sound with a little distortion, which might IV. CONCLUSION ICA separates vocal and accompaniment successfully. However, ICA requires more observations than the number of sources. In my case, the observations are the two channels, left and right, of the track. This limits the output to two sources. If I wish to separate more sources, for example, different instruments in the accompaniment, I will need to exploit more features from the given source. The DUET algorithm can separate an arbitrary number of sources given two anechoic observations []. However, it assumes that the sources are distinguishable in a time-frequency domain found by applying Fourier transform. This is true for speech signals where signal power is concentrated where vowels appears since consonants act as Gaussian white noise. However, for musical instruments, signal power is separated in each octave, which makes it hard to distinguish from one another. CQT is another mapping from the time domain to the frequency domain. Unlike Fourier transform, CQT has a logarithm spacing in the frequency domain, giving it a linear representation of musical notes. This allows me to separate different musical instruments. I implemented an iterative method to isolate each sources from the time-frequency domain generated by CQT. My algorithm can separate different musical instruments from a given mixture, and has improved SNR of the estimated sources by 3dB compared to the original DUET.

To sum up, the tools from class that I used in this job are sampling to sample data from continuous time into discrete time; Fourier transforms, fast Fourier transform to transform my dataset into frequncy domain; filter designs with moving averages for extracting the estimated components. Additionally, I did this project with some machine learning techniques, such as ICA, DUET, k-means and CQT. Therefore, I acquire tons of knowledge in this project that gives me an opportunity to do some practical things. V. FUTURE WORKS At the end of this project, I haven t succeeded to incorporate CQT with DUET as Fig. Instead, I implemented several different separation criteria in the time-frequency domain to exploit CQT to separate musical instruments. While I hand tune the masking parameters, machine learning techniques can be applied to learn the optimal clustering parameters in the frequency domain. Such techniques can also be incorporated with the DUET algorithm to automate peak detection. Fig.. System diagram of the proposed algorithm. REFERENCES [] http://research.ics.aalto.fi/ica/fastica/ [] Scott Rickard, The DUET Blind Source Separation Algorithm, pages 74. Springer Netherlands, 7. [3] Zafar Rafii and Bryan Pardo, Degenerate Unmixing Estimation Technique using the Constant Q Transform, 36th International Conference on Acoustics, Speech and Signal Processing, Prague, Czech Republic, May 7,. [4] Benjamin Blankertz, The Constant Q Transform, http://wwwmath. uni-muenster.de/logik/personen/blankertz/constq/constq.html [5] The music source is at musics/pop offvocal. [6] The music source is at musics/pop origin which is found from youtube. [7] The music source is at musics/speech which is found from http://eleceng. ucd.ie/ srickard/bss.html. [8] The music source is at musics/speech estimate. [9] The music source is at musics/elec origin which is found from youtube. [] The music source is at musics/elec estimate. [] The music source is at musics/instrus origin http://www.zafarrafii.com/ research.html. [] The music source is at musics/instrus bass. [3] The music source is at musics/instrus flute. [4] The test mixture is the mixture of test input and test input; the corresponding output of DUET is test duet est and test duet est, and the output of CQT is test CQT est and test CQT est.