Proposal for Application of Speech Techniques to Music Analysis

Similar documents
A Study of Synchronization of Audio Data with Symbolic Data. Music254 Project Report Spring 2007 SongHui Chon

Acoustic Scene Classification

Automatic Laughter Detection

International Journal of Advance Engineering and Research Development MUSICAL INSTRUMENT IDENTIFICATION AND STATUS FINDING WITH MFCC

Melody Retrieval On The Web

MUSICAL INSTRUMENT RECOGNITION WITH WAVELET ENVELOPES

Musical Instrument Identification Using Principal Component Analysis and Multi-Layered Perceptrons

Singer Identification

Music Radar: A Web-based Query by Humming System

Statistical Modeling and Retrieval of Polyphonic Music

Normalized Cumulative Spectral Distribution in Music

Automatic Laughter Detection

Automatic music transcription

Week 14 Query-by-Humming and Music Fingerprinting. Roger B. Dannenberg Professor of Computer Science, Art and Music Carnegie Mellon University

Improving Frame Based Automatic Laughter Detection

Automatic Commercial Monitoring for TV Broadcasting Using Audio Fingerprinting

A QUERY BY EXAMPLE MUSIC RETRIEVAL ALGORITHM

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Hidden Markov Model based dance recognition

A Fast Alignment Scheme for Automatic OCR Evaluation of Books

A System for Acoustic Chord Transcription and Key Extraction from Audio Using Hidden Markov models Trained on Synthesized Audio

A Music Retrieval System Using Melody and Lyric

Computational Modelling of Harmony

Singer Traits Identification using Deep Neural Network

Topics in Computer Music Instrument Identification. Ioanna Karydi

WE ADDRESS the development of a novel computational

Subjective Similarity of Music: Data Collection for Individuality Analysis

Classification of Musical Instruments sounds by Using MFCC and Timbral Audio Descriptors

Transcription of the Singing Melody in Polyphonic Music

Speech and Speaker Recognition for the Command of an Industrial Robot

A PERPLEXITY BASED COVER SONG MATCHING SYSTEM FOR SHORT LENGTH QUERIES

Chord Classification of an Audio Signal using Artificial Neural Network

Singer Recognition and Modeling Singer Error

Automatic Identification of Instrument Type in Music Signal using Wavelet and MFCC

Keywords Separation of sound, percussive instruments, non-percussive instruments, flexible audio source separation toolbox

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Paulo V. K. Borges. Flat 1, 50A, Cephas Av. London, UK, E1 4AR (+44) PRESENTATION

Characteristics of Polyphonic Music Style and Markov Model of Pitch-Class Intervals

Robert Alexandru Dobre, Cristian Negrescu

Music Recommendation from Song Sets

Automatic Construction of Synthetic Musical Instruments and Performers

Topic 10. Multi-pitch Analysis

Speech To Song Classification

Classification of Timbre Similarity

OBJECTIVE EVALUATION OF A MELODY EXTRACTOR FOR NORTH INDIAN CLASSICAL VOCAL PERFORMANCES

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Various Applications of Digital Signal Processing (DSP)

TOWARD AN INTELLIGENT EDITOR FOR JAZZ MUSIC

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

Automatic Extraction of Popular Music Ringtones Based on Music Structure Analysis

A CLASSIFICATION APPROACH TO MELODY TRANSCRIPTION

SYNTHESIS FROM MUSICAL INSTRUMENT CHARACTER MAPS

An Accurate Timbre Model for Musical Instruments and its Application to Classification

Piano Transcription MUMT611 Presentation III 1 March, Hankinson, 1/15

Non Stationary Signals (Voice) Verification System Using Wavelet Transform

Pattern Recognition in Music

A System for Automatic Chord Transcription from Audio Using Genre-Specific Hidden Markov Models

Study of White Gaussian Noise with Varying Signal to Noise Ratio in Speech Signal using Wavelet

Musical Creativity. Jukka Toivanen Introduction to Computational Creativity Dept. of Computer Science University of Helsinki

Query By Humming: Finding Songs in a Polyphonic Database

CSC475 Music Information Retrieval

158 ACTION AND PERCEPTION

Music Genre Classification and Variance Comparison on Number of Genres

A Bayesian Network for Real-Time Musical Accompaniment

Outline. Why do we classify? Audio Classification

Automatic Piano Music Transcription

Introductions to Music Information Retrieval

Audio Feature Extraction for Corpus Analysis

A probabilistic approach to determining bass voice leading in melodic harmonisation

Narrative Theme Navigation for Sitcoms Supported by Fan-generated Scripts

MUSI-6201 Computational Music Analysis

Music Database Retrieval Based on Spectral Similarity

Improving Polyphonic and Poly-Instrumental Music to Score Alignment

A Survey of Audio-Based Music Classification and Annotation

Speech Recognition and Signal Processing for Broadcast News Transcription

Proceedings of Meetings on Acoustics

... A Pseudo-Statistical Approach to Commercial Boundary Detection. Prasanna V Rangarajan Dept of Electrical Engineering Columbia University

Recognising Cello Performers using Timbre Models

Reducing False Positives in Video Shot Detection

Luwei Yang. Mobile: (+86) luweiyang.com

Retrieval of textual song lyrics from sung inputs

Automatic Classification of Instrumental Music & Human Voice Using Formant Analysis

Singing voice synthesis based on deep neural networks

Singing Pitch Extraction and Singing Voice Separation

Effects of acoustic degradations on cover song recognition

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

INTER GENRE SIMILARITY MODELLING FOR AUTOMATIC MUSIC GENRE CLASSIFICATION

Interacting with a Virtual Conductor

POLYPHONIC INSTRUMENT RECOGNITION USING SPECTRAL CLUSTERING

Arts, Computers and Artificial Intelligence

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

QUALITY OF COMPUTER MUSIC USING MIDI LANGUAGE FOR DIGITAL MUSIC ARRANGEMENT

AUTOMATIC ACCOMPANIMENT OF VOCAL MELODIES IN THE CONTEXT OF POPULAR MUSIC

Automatic Rhythmic Notation from Single Voice Audio Sources

Music in Practice SAS 2015

Vuzik: Music Visualization and Creation on an Interactive Surface

Efficient Computer-Aided Pitch Track and Note Estimation for Scientific Applications. Matthias Mauch Chris Cannam György Fazekas

Music Segmentation Using Markov Chain Methods

Transcription:

Proposal for Application of Speech Techniques to Music Analysis 1. Research on Speech and Music Lin Zhong Dept. of Electronic Engineering Tsinghua University 1. Goal Speech research from the very beginning has aimed at benefiting the populace instead of experts. Besides the military prospects from which it has got a great deal of money, the two main goals for speech research have been 1) low bit-rate transmission of speech and 2) speech based human-machine interaction. For the former speech coding/synthesis is developed and for the later, speech recognition technology. With the development of microelectronics and network technology, these two technologies seem more and more desirable and lucrative. That's another reason why speech research attracts money from governments and industries. On the contrary, computer music began as something like personal hobbies, then aimed at benefiting music people, especially, composers and music students. Computer music research had got limited funding because it benefited a very limited community. As signal processing people and funds are gathering before the Internet, music people have noticed the tremendous need of common people for accessing, retrieving and indexing network audio resources( Vercoe et al 1998). And the same need has been noticed by speech people for long( Rose 1995). Moreover, with the development of Multi-modal Human-Machine Interfaces (Sharma et al 1998) and Integrated Media Systems( McLeod et al 1999), just like speech and gesture, music is also promising to be one mode for people to communicate with computers as we can communicate with each other through music( Here is where musical sound synthesis would be most attractive, I think). Before the Internet and Computer, Researches on Speech and Music can find common goal. And I believe this is the right place for computer music research to get funding. Many of speech and music researches have physically similar goals. ( See table 1) 2. Method Many methods for speech synthesis/coding are quite different from those for music synthesis, because they have rather different goals. The best quality at the cost of lest computation and distortion is the foremost concern for transmission aimed speech coding/synthesis, while usually authenticity is that for music synthesis. For the goal of natural mode based human machine interfaces, machines have to speak to their users. Here authenticity also become the foremost concern for speech synthesis, exactly speaking, text-to-speech synthesis. And many techniques employed here are just "transformed" versions of those for music synthesis. There are three strategies ever employed to tackle speech recognition ( Rabiner & Juang 1993), i.e., Acoustics-Phonetics, Pattern and Artificial Intelligence. However, the so far most successful one is statistics-based pattern recognition strategy, (although speech people are seeking ways to do away with the "devil" of statistics and welcome explicit modeling). When Spoken Language Understanding is concerned, AI is also incorporated. In addition, For speech, cepstral analysis is as widely employed as temporal or spectral analysis. The most widely adopted strategy for musical sound analysis seems to combine Artificial Intelligent, Psycho-acoustics, Music Theory and Signal Processing. Such strategy is more deterministic than statistic. The methods based on statistics or pattern classification hasn't yet been fully developed( Martin et al 1998b, Brown 1999 and Raphael 1999). Temporal and spectral analyses are much more employed than cepstral analysis. This proposal is preliminary, incomplete and is to undergo revision

However, music people should take pride in their insightful explorations of psycho-acoustics and human auditory system( Ellis 1992, Martin et al 1998a and Scheirer 1998). Such knowledge haven't so far been much exploited for speech recognition. Music people should also be proud of the diversity of the techniques they have been studying. While many technique of speech analysis become "standard" in the speech community, there is no unanimously adopted or agreed techniques for music analysis. What's more, as statistical acoustical and language modeling has for long been the core for speech analysis, the biggest share of speech recognition research has gone into collecting the speech database and training the statistical models. In this regard, music people have done much more for signal processing of musical sound. Table 1. Speech Research Topics and Their Music Counterparts Speech Music Topic Note Topics Note i)speech Synthesis from Coded Speech Synthesis speech for transmission Musical ii)text-to-speech synthesis Sound Synthesis techniques iii)synthesis for a specific speaker Speaker GMM/Cepstrum Instrument(Timbre) Language Music Genre/Type Speech Explicit Statistical and Acoustical Music Transcription Modeling Pitch Nearly the same with music Pitch Tracking Mandarin Speech Pitch Tracking and Temporal Music Tone Tone analysis, Neural Networks Cocktail Party 1) Mic Array Technique Effect/Speaker 2) Voice Separation and Tracking and Tacking Separation i)score-to-music synthesis ii)synthesis for a specific instrument See Roads' Book GMM/ Correlogram, Cepstrum No explicit statistical and acoustical modeling Nearly the same with speech 3. State of the Art The success for Speech coding/synthesis is evident in view of the speech based telecommunication( Juang edited 1998). Speech coding and compression research is now heading toward the theoretic limit and more robustness. In addition, speech people now talk about " Spoken Language Systems"( Cole et al 1995) instead of "Automatic Speech ". The later has become only a sub-area of the former. This change in terminology reflects how great advance research on speech recognition has made in the past years. Various commercialized automatic speech recognizers are available on the PC and single Chip( For example, see the homepage of Sensory, Inc.). Music synthesis for economical representation and transmission is the same successful in many respects. The standardized MIDI format is an excellent example. However, due to the inherent complexity of music and musical instrument acoustics, computer synthesized sound for such instruments as violin is far from satisfying in view of the quality and the required computation. Furthermore, unlike speech synthesis, which in most cases is only required to be understandable, musical instrumental sound synthesis must be enjoyable. Thus the big challenge for music synthesis. As for music analysis, " Today's computer systems are not capable of understanding music at the level of an average five-year-old; they cannot recognize a melody in a polyphonic recording or understand a song on a children's television program."( Martin et al 1998a). I believe there are much more interesting things to do with musical sound. 2. Speech Techniques and Their Prospects in Music As my up to now research experience centers at speech recognition, I'll stay within this domain here. There are three technologies generally boasted by speech recognition people. I would like to discuss them among many

other techniques. 1. Hidden Markov Models( HMM's) HMM's are most successful for statistically and dynamically modeling the speech signal. The implementation of HMM training is direct and mathematically sound, and decent generalization from training samples is usually observed. The embedded Viterbi Dynamic Programming is able to warp and align speech time variation. Moreover, HMM as both a finite automaton and probabilistic model can incorporate rules( lexics, syntax, semantics, etc.) and other probabilistic models( Neural Networks, Linear Regression etc.) easily. A system based on HMM is ready to be both knowledge-driven and data-driven. These features are desirable for music sound too. However, there are many limits for HMM's. The most serous two of them are 1. To ensure both robustness and accuracy, there are never enough speech data to train the system. If HMM's are applied to modeling musical sound, this problem would be more serious for the musical sound databases are just emerging. Speech people, while cherishing statistics, is seeking to avoid it. Researches have been conducted to model many phenomena explicitly instead of statistically. Such phenomena include speaker variation, background noise and channel distortion, etc.. Doing so will reduce the training data tremendously. 2. Distribution estimation based on HMM's is parametric and non-discriminative. The popular maximum likelihood estimation(mle) cannot ensure minimum classification error( MCE). One of the recent research foci is for training techniques that would secure most accuracy and robustness. Such training techniques as Maximum a Posteriori (MAP) training, Maximum Mutual Information(MMI) training and Generalized Probabilistic Descent/Minimum Classification Error(GPD/MCE) training have been advertised. I think that HMM's are the most promising technique that would be readily applied to music analysis. Some of the many possibilities are 1. Statistical Music Object Modeling and. Many music objects can be identified by people only through many notes or measures. They are identified as a dynamic time series. The methods ever employed are poor at dealing with the time-dynamic and integrating long segments. Moreover, knowledge based identifying systems can achieve little generalization. As HMM's are successful in modeling and recognizing isolated spoken words and phrases, I'd like to apply them to modeling and recognizing music object. Timbre recognition could be tackled by HMM's too. The current instrument recognition techniques base decision on only a few frames( Martin et al 1998b and Brown 1999). They try to catch the timbre by a glance instead of gazing at it. HMM's is superior in this respect. Music Object is significant because it paves the way for Music Object Extraction and Manipulation, thus for Content/Perception-based Music Resource Management and Access, and for model-based Musical Sound Data Reduction. 2. Statistical Model based Music Transcription. Music transcription would be another applications for statistical music object modeling. Until now, I haven't read any article about music transcription based on statistical models. Intuitively, if we statistically model the different changes of pitch( e.g., octave up or down, semitone up or down) as music objects( this is feasible just because music pitch change is quantified instead of continuous), we can track the pitch changes all through. Of course other techniques such as temporal analysis could also be incorporated. I'll continue this possibility in the following parts just to illustrate what one general speech technology could possibly do for music and how far it could go. I don't mean it is the only application speech technology could find for music or it is the best thing speech technology could do for music. 2. Mel Frequency Cepstrum Coefficients( MFCC's ) MFCC's demonstrate more robustness and accuracy while require more computation than another set of popular speech feature, Linear Predictive Cepstrum Coefficients(LPCC's). They are based on the simple knowledge for the frequency differentiating ability of the human cochlea. I don't think MFCC would be any better than the sound features already employed by most music people. As stated before, researches on computer music have employed much more psycho-acoustics. Here Musical Object refers to many kinds of musical events, such as sforzando, solo/tutti, crescendo/decrescendo, syncopation, pizzicato etc.. Such events are very important for common people to identify and describe music they've heard.

However, there are two issues I'd like to stress. First, most sound features employed in music analysis are resource(both computation and storage) demanding. To be implemented on PC's or inexpensive chips, which is important for them to be popular, music analysis systems must find efficient features like MFCC's and LPCC's. Another issue is Cepstral Analysis could be tried more for music analysis. For statistical acoustical modeling, acoustical features are of great importance, especially modeling music sound, which is extremely rich in perceptual information. Perceptually informative and integratable sound features are required for successful statistical modeling. Features extractable for PC's and inexpensive DSP or ASIC are very desirable for inexpensive music analysis systems. 3. N-Gram N-Gram is a statistics based language model. It is constructed by sorting and counting through the related training language materials, both spoken and written. Grammar rules can be explicitly incorporated into the language model. N-Gram has led to some successful Large Vocabulary Continuous Speech Systems ( For example, ViaVoice of IBM) and many domain/application/task related practical systems. I base my discussion on the successful music object modeling. If we have modeled music object, The way these models are organized certainly make up some kind of counterpart for the language grammar in speech recognition. Such music event/object organizing language could also be constructed in a similar way as N-Gram. Even the language could itself be the music version of N-Gram. Combining statistical music object modeling and the Music N-Gram, we could establish music analyzing, parsing and understanding systems. HMM, MFCC and N-grams have become the prototypes for statistical acoustical models, speech features and language models, respectively, for speech recognition. They are very general techniques. After examining Table 1., We could also obtain other specific speech techniques which would possible be applied to computer music. 3. Proposal Following the preceding two section, my proposal can be concise 1. Goal A graduate study oriented toward the Ph.D. should no doubt emphasize on applicable research and scholarship rather than on engineering practice. However, I believe that to get a clear idea about where one's research is heading will contribute a great deal to one's research. So, I propose to conduct research on music analysis/synthesis for Intelligent Network-based Music Resources Manipulation and Intelligent Multi-modal Human Machine Interfaces, that is, 1. Music Analysis for Content/Perception-based Musical Sound Data Indexing, Retrieval, Manipulation and Reduction, including Musical Type, Timbre and Object Identification; Musical Object Extraction and Manipulation; Resource Efficient Music Analysis Systems Perception Based Acoustical Features 2. Music Synthesis for Interactive Human-Machine Interfaces. 2. Method There is a big pool of candidate methods that could be adopted, from both speech and music research. In addition to the usually adopted methods, I propose to try the following speech techniques for the music analysis research, 1. Statistical Acoustical Modeling for Musical Sound, especially HMM's 2. Statistical Language Modeling for Music 3. Cepstral Analysis 4. Advanced Pattern Methodologies for Music Analysis We conduct research not just to be original or try new methods. Instead, we originate for we find the current

tools are limited; we try new methods for we expect they would yield better results. Never mind the cat is white or black, it's a good cat only when it catch mice(a Chinese saying). The case is the same in the computer music research. 3. Condition: A Personal View Four conditions have contributed most to speech research, 1) continual supports from governments and industries, 2)widely recognized and adopted standard speech database, 3)close collaboration of researchers in signal processing, statistics, linguistics and natural language understanding and 4) prompt commercialization of feasible speech technologies. Similarly, to be successful in the long run, computer music research should 1. Aim research at benefiting common people in addition to wealthy music experts or studio engineers( I firmly believe this is a common trend for signal processing and network researches.) 2. Try and Recognize the emerging musical sound databases( This need time and money). 3. Set up multidisciplinary groups which can accommodate direct communication and close collaboration( The current research groups are usually strong in some aspects while weak in others). And 4. Commercialize and Advertise any "usable", if not perfect, music technology( Doing so will not only get research better funded but also pave way for future widespread of computer music systems). 4. Reference Brown, J.(1999), "Computer Identification of Musical Instruments Using Pattern with Cepstral Coefficients as Features", J. Acoust. Soc. Am., Vol. 105, No. 3, March.. Cole, R. et al (1995), " The Challenge of Spoken Language Systems: Research Directions for the Nineties," IEEE Trans. Speech and Audio Processing, vol.3, no.1, pp.1-21 Ellis, D.P.W. (1992), A Perceptual Representation of Audio, Master's thesis, EECS dept, MIT Juang, B.-H. Edited( 1998), "The Past, Present, and Future of Speech Processing", IEEE Signal Processing Magazine, May, pp.24-48. Martin, K. et al (1998a), "Music Content Analysis through Models of Audition," Presented at the 1998 ACM Multimedia Workshop on Content Processing of Music for Multimedia Applications, Bristol, England Martin, K.. et al(1998b), " 2pMU9: Music Instrument Identification: A Pattern- Approach", presented at the 136 th meeting of the J. Acoust. Soc. Am., Oct.13th McLeod, D. et al(1999), " The Move Toward Media Immersion", IEEE Signal Processing Magazine, Jan., pp.33-43 Rabiner L. and Juang, B.-H.(1993), Fundamentals of Speech, Prentice-Hall, Englewood Cliff, New Jersey Raphael, C.(1999), " Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.21,no.4, April Rose, R.C.(1995), Keyword detection in conversational speech utterance using HMM based continuous speech recognition, Computer, Speech & Language,9,309-333 Scheirer, E.(1998), Music Perception Systems, Unpublished Proposal for Ph.D. Dissertation, MIT Media Laboratory Sharma, R.et al(1998), "Toward Multimodal Human-Computer Interface," Proc.IEEE, vol.86, no.5, pp.853-869 Vercoe, B.L. et al(1998), " Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations," Proc. IEEE, vol. 86, no.5, pp.922-940