Proposal for Application of Speech Techniques to Music Analysis

Proposal for Application of Speech Techniques to Music Analysis 1. Research on Speech and Music Lin Zhong Dept. of Electronic Engineering Tsinghua University 1. Goal Speech research from the very beginning has aimed at benefiting the populace instead of experts. Besides the military prospects from which it has got a great deal of money, the two main goals for speech research have been 1) low bit-rate transmission of speech and 2) speech based human-machine interaction. For the former speech coding/synthesis is developed and for the later, speech recognition technology. With the development of microelectronics and network technology, these two technologies seem more and more desirable and lucrative. That's another reason why speech research attracts money from governments and industries. On the contrary, computer music began as something like personal hobbies, then aimed at benefiting music people, especially, composers and music students. Computer music research had got limited funding because it benefited a very limited community. As signal processing people and funds are gathering before the Internet, music people have noticed the tremendous need of common people for accessing, retrieving and indexing network audio resources( Vercoe et al 1998). And the same need has been noticed by speech people for long( Rose 1995). Moreover, with the development of Multi-modal Human-Machine Interfaces (Sharma et al 1998) and Integrated Media Systems( McLeod et al 1999), just like speech and gesture, music is also promising to be one mode for people to communicate with computers as we can communicate with each other through music( Here is where musical sound synthesis would be most attractive, I think). Before the Internet and Computer, Researches on Speech and Music can find common goal. And I believe this is the right place for computer music research to get funding. Many of speech and music researches have physically similar goals. ( See table 1) 2. Method Many methods for speech synthesis/coding are quite different from those for music synthesis, because they have rather different goals. The best quality at the cost of lest computation and distortion is the foremost concern for transmission aimed speech coding/synthesis, while usually authenticity is that for music synthesis. For the goal of natural mode based human machine interfaces, machines have to speak to their users. Here authenticity also become the foremost concern for speech synthesis, exactly speaking, text-to-speech synthesis. And many techniques employed here are just "transformed" versions of those for music synthesis. There are three strategies ever employed to tackle speech recognition ( Rabiner & Juang 1993), i.e., Acoustics-Phonetics, Pattern and Artificial Intelligence. However, the so far most successful one is statistics-based pattern recognition strategy, (although speech people are seeking ways to do away with the "devil" of statistics and welcome explicit modeling). When Spoken Language Understanding is concerned, AI is also incorporated. In addition, For speech, cepstral analysis is as widely employed as temporal or spectral analysis. The most widely adopted strategy for musical sound analysis seems to combine Artificial Intelligent, Psycho-acoustics, Music Theory and Signal Processing. Such strategy is more deterministic than statistic. The methods based on statistics or pattern classification hasn't yet been fully developed( Martin et al 1998b, Brown 1999 and Raphael 1999). Temporal and spectral analyses are much more employed than cepstral analysis. This proposal is preliminary, incomplete and is to undergo revision

However, music people should take pride in their insightful explorations of psycho-acoustics and human auditory system( Ellis 1992, Martin et al 1998a and Scheirer 1998). Such knowledge haven't so far been much exploited for speech recognition. Music people should also be proud of the diversity of the techniques they have been studying. While many technique of speech analysis become "standard" in the speech community, there is no unanimously adopted or agreed techniques for music analysis. What's more, as statistical acoustical and language modeling has for long been the core for speech analysis, the biggest share of speech recognition research has gone into collecting the speech database and training the statistical models. In this regard, music people have done much more for signal processing of musical sound. Table 1. Speech Research Topics and Their Music Counterparts Speech Music Topic Note Topics Note i)speech Synthesis from Coded Speech Synthesis speech for transmission Musical ii)text-to-speech synthesis Sound Synthesis techniques iii)synthesis for a specific speaker Speaker GMM/Cepstrum Instrument(Timbre) Language Music Genre/Type Speech Explicit Statistical and Acoustical Music Transcription Modeling Pitch Nearly the same with music Pitch Tracking Mandarin Speech Pitch Tracking and Temporal Music Tone Tone analysis, Neural Networks Cocktail Party 1) Mic Array Technique Effect/Speaker 2) Voice Separation and Tracking and Tacking Separation i)score-to-music synthesis ii)synthesis for a specific instrument See Roads' Book GMM/ Correlogram, Cepstrum No explicit statistical and acoustical modeling Nearly the same with speech 3. State of the Art The success for Speech coding/synthesis is evident in view of the speech based telecommunication( Juang edited 1998). Speech coding and compression research is now heading toward the theoretic limit and more robustness. In addition, speech people now talk about " Spoken Language Systems"( Cole et al 1995) instead of "Automatic Speech ". The later has become only a sub-area of the former. This change in terminology reflects how great advance research on speech recognition has made in the past years. Various commercialized automatic speech recognizers are available on the PC and single Chip( For example, see the homepage of Sensory, Inc.). Music synthesis for economical representation and transmission is the same successful in many respects. The standardized MIDI format is an excellent example. However, due to the inherent complexity of music and musical instrument acoustics, computer synthesized sound for such instruments as violin is far from satisfying in view of the quality and the required computation. Furthermore, unlike speech synthesis, which in most cases is only required to be understandable, musical instrumental sound synthesis must be enjoyable. Thus the big challenge for music synthesis. As for music analysis, " Today's computer systems are not capable of understanding music at the level of an average five-year-old; they cannot recognize a melody in a polyphonic recording or understand a song on a children's television program."( Martin et al 1998a). I believe there are much more interesting things to do with musical sound. 2. Speech Techniques and Their Prospects in Music As my up to now research experience centers at speech recognition, I'll stay within this domain here. There are three technologies generally boasted by speech recognition people. I would like to discuss them among many

other techniques. 1. Hidden Markov Models( HMM's) HMM's are most successful for statistically and dynamically modeling the speech signal. The implementation of HMM training is direct and mathematically sound, and decent generalization from training samples is usually observed. The embedded Viterbi Dynamic Programming is able to warp and align speech time variation. Moreover, HMM as both a finite automaton and probabilistic model can incorporate rules( lexics, syntax, semantics, etc.) and other probabilistic models( Neural Networks, Linear Regression etc.) easily. A system based on HMM is ready to be both knowledge-driven and data-driven. These features are desirable for music sound too. However, there are many limits for HMM's. The most serous two of them are 1. To ensure both robustness and accuracy, there are never enough speech data to train the system. If HMM's are applied to modeling musical sound, this problem would be more serious for the musical sound databases are just emerging. Speech people, while cherishing statistics, is seeking to avoid it. Researches have been conducted to model many phenomena explicitly instead of statistically. Such phenomena include speaker variation, background noise and channel distortion, etc.. Doing so will reduce the training data tremendously. 2. Distribution estimation based on HMM's is parametric and non-discriminative. The popular maximum likelihood estimation(mle) cannot ensure minimum classification error( MCE). One of the recent research foci is for training techniques that would secure most accuracy and robustness. Such training techniques as Maximum a Posteriori (MAP) training, Maximum Mutual Information(MMI) training and Generalized Probabilistic Descent/Minimum Classification Error(GPD/MCE) training have been advertised. I think that HMM's are the most promising technique that would be readily applied to music analysis. Some of the many possibilities are 1. Statistical Music Object Modeling and. Many music objects can be identified by people only through many notes or measures. They are identified as a dynamic time series. The methods ever employed are poor at dealing with the time-dynamic and integrating long segments. Moreover, knowledge based identifying systems can achieve little generalization. As HMM's are successful in modeling and recognizing isolated spoken words and phrases, I'd like to apply them to modeling and recognizing music object. Timbre recognition could be tackled by HMM's too. The current instrument recognition techniques base decision on only a few frames( Martin et al 1998b and Brown 1999). They try to catch the timbre by a glance instead of gazing at it. HMM's is superior in this respect. Music Object is significant because it paves the way for Music Object Extraction and Manipulation, thus for Content/Perception-based Music Resource Management and Access, and for model-based Musical Sound Data Reduction. 2. Statistical Model based Music Transcription. Music transcription would be another applications for statistical music object modeling. Until now, I haven't read any article about music transcription based on statistical models. Intuitively, if we statistically model the different changes of pitch( e.g., octave up or down, semitone up or down) as music objects( this is feasible just because music pitch change is quantified instead of continuous), we can track the pitch changes all through. Of course other techniques such as temporal analysis could also be incorporated. I'll continue this possibility in the following parts just to illustrate what one general speech technology could possibly do for music and how far it could go. I don't mean it is the only application speech technology could find for music or it is the best thing speech technology could do for music. 2. Mel Frequency Cepstrum Coefficients( MFCC's ) MFCC's demonstrate more robustness and accuracy while require more computation than another set of popular speech feature, Linear Predictive Cepstrum Coefficients(LPCC's). They are based on the simple knowledge for the frequency differentiating ability of the human cochlea. I don't think MFCC would be any better than the sound features already employed by most music people. As stated before, researches on computer music have employed much more psycho-acoustics. Here Musical Object refers to many kinds of musical events, such as sforzando, solo/tutti, crescendo/decrescendo, syncopation, pizzicato etc.. Such events are very important for common people to identify and describe music they've heard.

However, there are two issues I'd like to stress. First, most sound features employed in music analysis are resource(both computation and storage) demanding. To be implemented on PC's or inexpensive chips, which is important for them to be popular, music analysis systems must find efficient features like MFCC's and LPCC's. Another issue is Cepstral Analysis could be tried more for music analysis. For statistical acoustical modeling, acoustical features are of great importance, especially modeling music sound, which is extremely rich in perceptual information. Perceptually informative and integratable sound features are required for successful statistical modeling. Features extractable for PC's and inexpensive DSP or ASIC are very desirable for inexpensive music analysis systems. 3. N-Gram N-Gram is a statistics based language model. It is constructed by sorting and counting through the related training language materials, both spoken and written. Grammar rules can be explicitly incorporated into the language model. N-Gram has led to some successful Large Vocabulary Continuous Speech Systems ( For example, ViaVoice of IBM) and many domain/application/task related practical systems. I base my discussion on the successful music object modeling. If we have modeled music object, The way these models are organized certainly make up some kind of counterpart for the language grammar in speech recognition. Such music event/object organizing language could also be constructed in a similar way as N-Gram. Even the language could itself be the music version of N-Gram. Combining statistical music object modeling and the Music N-Gram, we could establish music analyzing, parsing and understanding systems. HMM, MFCC and N-grams have become the prototypes for statistical acoustical models, speech features and language models, respectively, for speech recognition. They are very general techniques. After examining Table 1., We could also obtain other specific speech techniques which would possible be applied to computer music. 3. Proposal Following the preceding two section, my proposal can be concise 1. Goal A graduate study oriented toward the Ph.D. should no doubt emphasize on applicable research and scholarship rather than on engineering practice. However, I believe that to get a clear idea about where one's research is heading will contribute a great deal to one's research. So, I propose to conduct research on music analysis/synthesis for Intelligent Network-based Music Resources Manipulation and Intelligent Multi-modal Human Machine Interfaces, that is, 1. Music Analysis for Content/Perception-based Musical Sound Data Indexing, Retrieval, Manipulation and Reduction, including Musical Type, Timbre and Object Identification; Musical Object Extraction and Manipulation; Resource Efficient Music Analysis Systems Perception Based Acoustical Features 2. Music Synthesis for Interactive Human-Machine Interfaces. 2. Method There is a big pool of candidate methods that could be adopted, from both speech and music research. In addition to the usually adopted methods, I propose to try the following speech techniques for the music analysis research, 1. Statistical Acoustical Modeling for Musical Sound, especially HMM's 2. Statistical Language Modeling for Music 3. Cepstral Analysis 4. Advanced Pattern Methodologies for Music Analysis We conduct research not just to be original or try new methods. Instead, we originate for we find the current

tools are limited; we try new methods for we expect they would yield better results. Never mind the cat is white or black, it's a good cat only when it catch mice(a Chinese saying). The case is the same in the computer music research. 3. Condition: A Personal View Four conditions have contributed most to speech research, 1) continual supports from governments and industries, 2)widely recognized and adopted standard speech database, 3)close collaboration of researchers in signal processing, statistics, linguistics and natural language understanding and 4) prompt commercialization of feasible speech technologies. Similarly, to be successful in the long run, computer music research should 1. Aim research at benefiting common people in addition to wealthy music experts or studio engineers( I firmly believe this is a common trend for signal processing and network researches.) 2. Try and Recognize the emerging musical sound databases( This need time and money). 3. Set up multidisciplinary groups which can accommodate direct communication and close collaboration( The current research groups are usually strong in some aspects while weak in others). And 4. Commercialize and Advertise any "usable", if not perfect, music technology( Doing so will not only get research better funded but also pave way for future widespread of computer music systems). 4. Reference Brown, J.(1999), "Computer Identification of Musical Instruments Using Pattern with Cepstral Coefficients as Features", J. Acoust. Soc. Am., Vol. 105, No. 3, March.. Cole, R. et al (1995), " The Challenge of Spoken Language Systems: Research Directions for the Nineties," IEEE Trans. Speech and Audio Processing, vol.3, no.1, pp.1-21 Ellis, D.P.W. (1992), A Perceptual Representation of Audio, Master's thesis, EECS dept, MIT Juang, B.-H. Edited( 1998), "The Past, Present, and Future of Speech Processing", IEEE Signal Processing Magazine, May, pp.24-48. Martin, K. et al (1998a), "Music Content Analysis through Models of Audition," Presented at the 1998 ACM Multimedia Workshop on Content Processing of Music for Multimedia Applications, Bristol, England Martin, K.. et al(1998b), " 2pMU9: Music Instrument Identification: A Pattern- Approach", presented at the 136 th meeting of the J. Acoust. Soc. Am., Oct.13th McLeod, D. et al(1999), " The Move Toward Media Immersion", IEEE Signal Processing Magazine, Jan., pp.33-43 Rabiner L. and Juang, B.-H.(1993), Fundamentals of Speech, Prentice-Hall, Englewood Cliff, New Jersey Raphael, C.(1999), " Automatic Segmentation of Acoustic Musical Signals Using Hidden Markov Models," IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.21,no.4, April Rose, R.C.(1995), Keyword detection in conversational speech utterance using HMM based continuous speech recognition, Computer, Speech & Language,9,309-333 Scheirer, E.(1998), Music Perception Systems, Unpublished Proposal for Ph.D. Dissertation, MIT Media Laboratory Sharma, R.et al(1998), "Toward Multimodal Human-Computer Interface," Proc.IEEE, vol.86, no.5, pp.853-869 Vercoe, B.L. et al(1998), " Structured Audio: Creation, Transmission, and Rendering of Parametric Sound Representations," Proc. IEEE, vol. 86, no.5, pp.922-940