MIRAI. Rory A. Lewis. PhD Thesis Qualification Paper. For. Dr. Mirsad Hadzikadic. Ph.D Dr. Tiffany M. Barnes. Ph.D. Dr. Zbigniew W. Ras. Sc.

Size: px

Start display at page:

Download "MIRAI. Rory A. Lewis. PhD Thesis Qualification Paper. For. Dr. Mirsad Hadzikadic. Ph.D Dr. Tiffany M. Barnes. Ph.D. Dr. Zbigniew W. Ras. Sc."

Dale Arnold
6 years ago
Views:

1 MIRAI MUSIC INFORMATION RETRIEVAL BASED ON AUTOMATIC INDEXING Rory A. Lewis PhD Thesis Qualification Paper For Dr. Mirsad Hadzikadic. Ph.D Dr. Tiffany M. Barnes. Ph.D. Dr. Zbigniew W. Ras. Sc., PhD Department of Computer Science University of North Carolina at Charlotte, 90 University City Blvd., Charlotte, NC 83

2 MIRAI MUSIC INFORMATION RETRIEVAL BASED ON AUTOMATIC INDEXING Abstract: Increasing growth and popularity of multimedia resources available on the Web brought the need to provide new, more advanced tools needed for research. However, searching through multimedia data is highly non-trivial task that requires content-based indexing of the data. My research will focus on automatic extraction of information about the sound timbre, and indexing sound data with information about musical instrument playing in a given segment. Sound timbre is a very important factor that can affect the perceptual grouping of music. The real use of timbre-based grouping of music is very nicely discussed in (Bregman, 990). The aim is to perform automatic classification of musical instrument sound from real recordings for a broad range of sounds, independently of the fundamental frequency of the sound. My thesis will focus on musical instruments of definite pitch, used in contemporary orchestras and bands. Full range of musical scale for each instrument will be investigated. The investigation will start with descriptors depicted in MPEG-7. Although MPEG-7 provides some tools for indexing with musical instrument names, this information is inserted rather manually (for instance, tracks are labeled with voices/instruments in recording studios). There are no algorithms included in MPEG-7 to automate this task. In order to index enormous amount of audio files of various origin which are available for users on the Web, special processing and new algorithms are needed to extract this kind of knowledge directly from audio signals. The Music Information Retrieval Based on Automatic Indexing will be called MIRAI and will be based on low-level descriptors that can be easily extracted automatically for any audio signal. Apart from observing descriptor set for a given frame, it will also trace descriptor changes in time. Finally, if MPEG-7 becomes commonly used as standard, the results of this research will provide its interoperability for various applications in the music domain. Automatic sound indexing should allow labeling sound segments with instrument names. The MIRAI implementation will start with singular, homophonic sounds of musical instruments, and then extend my investigations to simultaneous, polyphonic sounds. Knowledge discovery techniques will be applied at this stage of research. First of all, we have to discover rules that recognize various musical instruments. Next, we apply these rules, one by one, to unknown sounds. By identifying so called supporting rules, we should be able to point out which instrument is playing (or is dominating) in a given segment, and in what time instants this instrument starts and ends playing. Additionally, MIRAI will extract sound parameters called pitch information which is one of the important factors in sound classification. By combining melody and timbre information, MIRAI should be able to search successfully for favorite tunes played by favorite instruments. The Significance of thesis: The MIRAI thesis will advance research on automatic content extraction from audio data, with application to full-band musical sounds, as opposed to quite broad research on speech signal, usually limited in frequency range. Investigating on automatic indexing of instrumental recordings will also allow formalizing the description of sound timbre for musical instruments. There is a number of different approaches to sound timbre (for instance (Balzano, 986) or (Cadoz, 985)). Dimensional approach to timbre description was proposed by (Bregman, 990). Timbre description is basically subjective and vague, and only some subjective features have well defined objective counterparts, like brightness, calculated as gravity center of the spectrum. Explicit formulation of rules of objective

3 specification of timbre in terms of digital descriptors will formally express subjective and informal sound characteristics. It is especially important in the light of human perception of sound timbre. Timevariant information is necessary for correct classification of musical instrument sounds, because quasisteady state itself is not sufficient for human experts. Therefore, evolution of sound features in time should be reacted in sound description as well. The discovered temporal patterns may better express sound features than static features, especially that classic features can be very similar for sounds representing the same family or pitch, whereas changeability of features with pitch for the same instrument makes sounds of one instrument dissimilar. Therefore, classical sound features can make correct identification of musical instrument independently on the pitch very difficult and erroneous. This research represents the first application of discovering temporal patterns in time evolution of MPEG-7 based, low-level sound descriptors of musical instrument sounds, with application to simultaneous sounds. KDD methods applied to extraction of temporal patterns and searching for the best classifier (quite successful in other domains, i.e. business, medicine) will aid signal analysis methods used as a preprocessing tool and will contribute to development of my knowledge on musical timbre. I will also perform research on construction of new attributes in order to find the best representation for sound recognition purposes. In recent years, there has been a tremendous need for the ability to query and process vast quantities of musical data, which are not easy to describe with mere symbols. Automatic content extraction is clearly needed here and it relates to the ability of identifying the segments of audio in which particular instruments are playing. It also relates to the ability of identifying musical pieces representing different types of emotions, which music clearly evokes, or generating humanlike expressive performances (Mantaras and Arcos, 00). Automatic content extraction may relate to many different types of semantic information related to musical pieces. Some information can be stored as metadata provided by experts, but some has to be computed in an automatic way. I believe that my approach based on KDD techniques will advance research to automatic content extraction, not only in identifying the segments of audio in which particular instruments are playing, but also in identifying the segments of audio containing other, more complex semantic information. Background for the MIRAI: In recent years, automatic indexing of multimedia data became an area of considerable research interest because of the need for quick searching of digital multimedia files. Broad access to the Internet, available for millions of users, creates significant market for the products dealing with content-based searching through multimedia files. The domain of image processing and content extraction is extensively explored all over the world and there are numerous publications available on that topic. Automatic extraction of audio content is not that much explored, especially for musical sounds. Methods in the Research on Musical Instrument Sound Classification: Broader research on automatic musical instrument sound classification goes back to last few years. So far, there is no standard parameterization used as a classification basis. The sound descriptors used are based on various methods of analysis of time and spectrum domain, with Fourier Transform for spectral analysis being most common. Also, wavelet analysis gains increasing interest for sound and especially for musical sound analysis and representation, see for instance (Popovic, Coifman and Berger, 995), (Goodwin, 997). Diversity of sound timbres is also used to facilitate data visualization via sonification, in order to make complex data easier to perceive (Ben-Tal, Berger, B. Cook, Daniels, Scavone and P. Cook, 00). Many parameterization and recognition methods, including pitch extraction techniques, applied in musical research come from speech and speaker recognition domain (Flanagan, 97), (Rabiner and Schafer, 978). Sound parameters applied in research performed so far in musical instrument classification include cepstral coeficients, constant-q coefficients, spectral centroid, autocorrelation coeficients, and moments of the time wave (Brown, Houix and McAdams, 00), wavelet analysis (Wieczorkowska, 00), (Kostek and Czyzewski 00), root mean square (RMS) amplitude envelope and multidimensional scaling analysis trajectories (Kaminskyj, 000), and various Page

spectral and temporal features (Kostek and Wieczorkowska, 997), (Martin and Kim, 998), (Wieczorkowska, 999).

00), making comparison of results more difficult. 3 Some experiments operate on a very limited set of data, like 4 instruments, or singular samples for each instrument.

Therefore, data sets used in experiments and the obtained results are not comparable.

4 spectral and temporal features (Kostek and Wieczorkowska, 997), (Martin and Kim, 998), (Wieczorkowska, 999). The sound sets used difier from experiment to experiment, with McGill University Master Samples (MUMS) CDs being most common (Opolko and Wapnick, 987), yet not always used (Brown, Houix and McAdams, 00), making comparison of results more difficult. 3 Some experiments operate on a very limited set of data, like 4 instruments, or singular samples for each instrument. Even if the investigations are performed on MUMS data, every researcher selects different group of instruments, number of classes, and testing method is also different. Therefore, data sets used in experiments and the obtained results are not comparable. Additionally, each research follows different parameterization technique(s), which makes comparison yet more difficult. I would like to apply low-level MPEG-7 audio descriptors as a starting point in the search for the best representation for musical instrument classification purposes. Since MPEG-7 is generally a standard for audio-video applications, audio description is not aimed at specific instrument or articulation indexing. Rather, descriptors in this standard were chosen on the basis of easiness of extraction. This is why we would like to use these descriptors for further investigations. The classifiers, applied in investigations on musical instrument sound classification, represent practically all known methods. The most popular classifier is k-nearest Neighbor (k-nn), see for example (Kaminskyj, 000). This classifier is relatively easy to implement and quite successful. Other reported results include Bayes decision rules, Gaussian mixture model (Brown, Houix and McAdams, 00), artificial neural networks (Kostek and Czyzewski 00), decision trees and rough set based algorithms (Wieczorkowska, 999), discriminant analysis (Martin and Kim, 998) hidden Markov Models (HMM), support vector machines (SVM) and other. The obtained results vary depending on the size of the data set, with accuracy reaching even 00% for 4 classes. However, the results for more than 0 instruments, explored in full musical scale range, generally are below 80%. The International Conferences on Music Information Retrieval classifies Music information retrieval methodology into three broad categories: Clever Searches, Sounds-like" Queries and Perceptual Queries, thereafter it categorized the type of recognition into to 6 categories; timbre recognition,singer recognition, melody recognition, rhythm recognition, genre recognition and mood recognition: A clever Search is the The Philips Audio Fingerprinting Finds a song one does not know by name but by tunes but it has a poor quality musical extractor and does not work with short segments of music Jaap Haitsma and Ton Kalker. A Highly Robust Audio Fingerprinting System. Proceedings of ISMIR 00, Paris, France, October 00. Known as Name That Tune. An example of a "Sounds-like" queries & categories is the Fischer Traditional Text and Numeric Queries. It searches and sorts sounds by similarity, based on pitch, loudness, brightness and/or overall timbre. However, it does not address sound at the level of the musical phrase, melody, rhythm or tempo. Kessler, D., T. Blum, T., J. Wheaton, & E. Wold,. A content-ware sound browser. Proc. of the International Computer Music Conference, ICMA, 999. AN example of a Search-by-perceptual-similarity, is the Studio Online which is a content-based search & classification interface. It has a primitive search-by-perceptual-similarity function Search,000 sounds for professionals to use. The problem is its dumb as it cannot interact or recognize for itself. Extensive Page 3

5 review of parameterization and classification methods applied in research on this topic, with obtained results, is given in (Herrera, Amatriain, Batlle and Serra, 000). The classifiers we would like to investigate include k-nn, HMM chosen by MPEG-7, and recently developed SVM. I also consider use of neural networks, especially time-delayed neural networks (TDNN), since they perform well in speech recognition applications (Meier, Stiefelhagen, Yang and Waibel, 000). The performance of the classifiers will be compared on the same testing data sets which we plan to elaborate for that purpose. Sound Data Generally, identification of musical information can be performed for audio samples taken from real recordings, representing waveform, and for MIDI (Musical Instrument Digital Interface) data. MIDI files give access to highly structured data. I provide information about the pitch (fundamental frequency), efiects applied, beginning and end of each note, voices (timbres) used, and about every note that is present in a given time instant. So, research on MIDI data may basically concentrate on higher level of musical structure, like key or metrical information. I plan to deal with recordings where for each channel there is only access to one-dimensional data, i.e. to single sample representing amplitude of the sound. Any basic information like pitch (or pitches, if there are more sounds), timbre, beginning and end of the sound must be extracted via digital signal processing. There are many methods of pitch extraction, mostly coming from speech processing. But even extraction of such a simple information may produce errors and poses some difficulties. Especially octave errors are common for a singular sound. Various errors can be produced for border frames, where consequent sound of different pitch are analyzed. Pitch extraction for layered sounds is even more difficult, especially when spectra overlap. Basically, parameters of fundamental frequency trackers are usually adjusted to characteristics of the instrument that is to be tracked, but this cannot be done when we do not know what instrument is playing. Identification of musical timbre is even more difficult. Timbre is rather subjective quality, defined by ANSI as the attribute of auditory sensation, in terms of which a listener can judge 4 that two sounds, similarly presented and having the same loudness and pitch, are different. Such definition is subjective and not of much use for automatic sound timbre classification. Therefore, musical sounds must be very carefully parameterized to allow automatic timbre recognition. I assume that time domain, spectrum, and evolution of sound features must be taken into account. 0_0_violin 0_0_viola 0_03_cello 0_04_doublebass 0_05_violin_ensemble 3G_violin_bowed_M.au 4A_violin_bowed_M.au 4D_violin_bowed_M.au 5E_violin_bowed_M.au 0_violin_bowedVibrato 0_violin_pizzicato 3G_violin_bowed_M.au 0_violin_mutedVibrato 0_violin_naturalHarmonics 0_violin_artificialHarmonics 0_violin_martelé Basic Parameterization of Musical Instrument Sounds and their Classification As we mentioned in Section., there exist numerous parameterization methods that have been applied to musical instrument sounds so far. In my research, we decided to base my parameterization on Page 4

6 MPEG-7 standard. This standard provides multimedia content description interface and if this standard gains popularity, the use of MPEG-7 based representation should increase usability of my work. MPEG-7 provides a universal mechanism for exchanging descriptors of multimedia data. MPEG7 shall support at least the description of the following types of auditory data: digital audio, analogue audio, MIDI files, model-based audio, and production data (Manjunath, Salembier and Sikora, 00). Subclasses of auditory data covered by this standard include: sound track (natural audio scene), music, speech, atomic sound efiects, symbolic audio representation, and mixing information. In MPEG-7, so- MPEG-7 called Multimedia Description Schemes provide the mechanisms, by which we can create ontologies, and dictionaries, in order to describe musical genre as a hierarchical taxonomy or identify a musical instrument from a list of controlled terms. Evolution of spectral sound features in time can be observed in MPEG-7 by means of HMM. Therefore, indexing a sound in this standard consists of selecting the best fit HMM in a classifier and generating the optimal state sequence (path) for that model. The path describes the evolution of a sound through time using a sequence of integer state indices as representation. Classifiers used so far in research on musical instrument sound classification include wide variety of methods, and the use of HMM is not obligatory in any way. I are going to use standard as a starting point only, taking sound descriptors as a basis for further processing and research. The problem here is that Timbre Description Tools provided within MPEG-7, based on simple descriptors like attack or brightness of sound, are aimed to describe out of 4 classes of all musical sounds, i.e. harmonic, coherent, sustained sounds, and non-sustained, percussive sounds. I would not like to limit ourselves to only these predefined in MPEG-7 musical timbre descriptors. Low-level descriptors which we plan to use in this project are defined for easy automatic calculation purpose. They may serve as a basis for the extraction of new parameters (for instance, AudioSpectrumBasis descriptor), better suited to instrument classification purpose. High-level descriptors from this standard cannot be extracted automatically, but basing on low-level descriptors we can calculate new ones, including linear or logical combinations of lower level parameters. Therefore, we decided to choose low-level MPEG-7 descriptors as a research basis, and then search for the classifier. Co-PI already started such experiments with searching for new attributes based on simpler ones, see (Slezak, Synak, Wieczorkowska, Wroblewski, 00). TV-trees Used for Content Description Representation of Audio Data (Ras & Wieczorkowska, 00, 003) used trees similar to telescopic vector trees (TV-trees) to represent content description of audio data in a multimedia database. I briey explain the notion we refer to as TV-tree by showing how it can be constructed for audio data. 5 First, each audio signal is divided into N window segments where each window segment is seen as a k-dimensional vector with coordinates being acoustical descriptors. Next, we partition the set of k-dimensional vectors into disjoint clusters, where in each cluster vectors that are similar with respect to maximal number of coordinates are kept. These coordinates are called active dimensions. For instance, assume that my plan is to represent the set of N vectors as a TV-tree of order, which means that the construction of only clusters per node is allowed. Firstly, we divide my set of N vectors into clusters in such a way that the total number of active dimensions in both clusters is maximized. I repeat the same process for each cluster, again trying to maximize the total number of active dimensions in the corresponding subclusters. I continue this process till all subclusters are relatively dense (all vectors are close to each other with respect to all dimensions). The underlying structure for this method is a binary tree with nodes storing information about the center of a corresponding cluster, the smallest radius of a sphere containing this cluster, and its list of active dimensions. TV-tree is a structure originating from textual databases and it was modified by (Ras & Wieczorkowska, 00, 003) to model audio data. The Proposed Thesis Research With increasing development of information systems containing multimedia information, especially Web-accessible databases, demand evokes on tools for content based browsing of multimedia files. Multimedia databases became a distinguished domain of research, and various aspects of this domain Page 5

7 have been considered and investigated (Fingerhut, 997), (Subrahmanian, 998). Research on image and video content indexing is conducted and published broadly all over the world, but audio content description has not been deeply explored so far, with exception to speech domain (Foote, 999). My research will be focused on context indexing in music. 3. Exemplary Scenario and Implementation Problems Let us assume that we ask an information system handling musical data to find all pieces with the melody similar to the one sang by a user, preferably carried by sax or other instrument he likes (or he thinks should be pleasant to listen). To begin, we should remember what was said in the Section. about audio files: musical data are basically stored in the waveform, or in MIDI form. As a consequence of this, automatic content extraction can be generally performed in a lower level of signal processing (identification of sound events) and in a higher level of musical structure extraction (themes and so on). In the case of MIDI data, we only deal with partially labeled musical information. I know what notes are played and what voices are there. However, we do not have access to high-level information, like key or bars. In the case of recordings, we can only read the amplitude of audio wave in a given time instant for each channel. To perform search in such a scenario, we must either process the data or have the data labeled with necessary information. For instance, we can label the musical pieces (possibly via automatic extraction) with the main themes of the tune, metrical and key information, and so on. Information about the performer, title, recording, or the issue date can also be useful here, but this information is normally given by the producer of the recording. First of all, such a system must be tolerant to all possible imperfections of the input sung query (Adams, Bartsch, Wakefield, 003): singing out of tune, notes or words missing or wrong, unstable and/or incorrect tempo, to mention the most common. Additionally, artistic interpretation can 6 alter the score quite dramatically and complicate the comparisons even more. In the case of sung songs, one can use text- theme begins and ends, what is a theme, accompaniment, harmonic structure and so on. Related base tools for searching if the user remembers the lyrics, but can we really rely on user's memory? When musical structure needs to be extracted, a number of new problems emerge: where the musical research is usually based on MIDI data to facilitate processing, and discovering information about full musical score from frequency, "note on" and "note ofi" data which only requires the use of knowledge on dependencies between musical events. Also, searching at the level of vague subjective music description is quite a challenging task. To start with, the system to be constructed which deals with very general queries, like "find me nice piece of music", must be tailored to individual preferences of the user, and requires dedicated research. The Audio Data to Analyze I are going to start my proposed research with collecting musical recordings in order to prepare a large set of data for processing, training and testing. The most ubiquitous sound formats present in the world resources available for computer use are.wav and.mp3. The latter format is most common in the Internet. However, this is a lossy format of sound compression, and audio data encoded using this standard are of lower quality. Additionally, most of sound files encoded in this format contain rock recordings, mostly with voices, percussion, keyboards producing synthesized sounds, and guitars. From one hand, such files contain processed audio data, with some audio elements removed, and noises introduced in case of bigger quantization step. Although these data are not clean, they keep the most important audio information that can be a basis for content investigation. On the other hand, main instruments of interest in such files are guitar and synthesizer (percussion playing rhythmic background does not produce sounds of definite pitch). Guitar sounds are relatively easy to recognize because of very specific envelope of time domain representation, i.e. fast attack and immediate release. Synthesizers can produce enormously huge range of timbres, also artificial, and labelling such timbre with particular synthetic voice does not seem to be a realistic task, whereas labelling such sounds with general "synthesizer" should be relatively easy for sustained sounds of definite pitch. Therefore, we should concentrate my research on non-compressed audio files, and.wav format seems to be a good choice because of its popularity. Important reason to deal with non-compressed audio files is to process good quality audio signal. As a starting point, we should process high quality recordings, containing Page 6

8 singular sounds of musical instruments. This is to make sure that we are parameterizing actual features of the selected musical instrument sound, not accidental noise, or features characterizing other instruments. After elaborating technology for singular sounds, we can start working with simultaneous sounds of various instruments. While dealing with polyphonic sounds, we must be aware of some limitations of this research. When many sounds are produced simultaneously, it is not always possible to separate all of them and classify correctly each timbre, neither for human listener nor for the recognition algorithm. When the produced sounds in monophonic recording start and end in the same time instant, when there are many of them, and the fundamental frequencies of these sounds are identical or harmonic, like in A major chord played by orchestra tutti, perfect recognition of all sounds with respect to the instrument is unfeasible. The conductor, very experienced and trained through many year's experience, may realize during the performance what instruments are playing which notes, but he has additional, spatial information to base on. In case of monophonic recordings, where for each time instant we have only one sample of amplitude available, there is no such spatial information. 7 In order to provide application of the elaborated algorithms to broad set of audio data, and also not to be limited to specific number of channels, we have to base on monophonic sound analysis and elaborate such a sound representation that allows extraction of timbre information from the investigated data. When polyphonic sounds are investigated, we still can analyze their spectra, spectrograms (spectral evolution), temporal evolution, or perform wavelet analysis, and separate pitches even for harmonic sounds, with limited accuracy (Popovic, Coifman and Berger, 995), (Virtanen, 003), (Cemgil, Kappen and Barber, 003). Audio source separation techniques can also be used, like independent component analysis (ICA) or sparse decompositions (SD) these techniques basically originate from speech recognition in a cocktail party environment and can also be applied to source localization for auditory scene analysis, see for instance (Cardoso, Comon, 996), (Vincent, Rodet, Rbel, Fvotte, Carpentier, Gribonval, Benaroya, Bimbot, 003) In case of multichannel recordings, we can also use spatial cues to separate sound sources with good results (Viste and Evangelista, 003). However, perfectly accurate identification of instruments playing exactly the same notes recorded in the same channel is unfeasible because of overlapping spectra. Moreover, layering sounds of some instruments is one of efiects used in composing to produce other timbre, and spectral overlap is applied here on purpose. Even trained listener may have difficulties in identification of the instruments in such case. To sum it up, we must remember that in case of polyphonic sounds, the recognition feasibility is limited. Therefore, for layered sounds of the same starting and ending time we can identify the most dominant instrument or instruments, but correct identification of all instruments is not possible. This is why we do not aim in identification of all simultaneously playing instruments, but only dominant one(s). In order to prepare a data set for further investigations, we should collect recordings of singular sounds of musical instruments, as well as concertos and recordings of chamber and symphony orchestra pieces. Singular sounds are starting point of the research, since they allow identification of sound features specific for each instrument. Next, concerti with one instrument dominating in the piece will be used for identification specific instrument in the presence of musical background. Finally, the orchestral pieces will be investigated to refine the elaborated parameterization and classification techniques. The authors will use public domain.wav files, but in order to produce good set of data, recording of audio CDs in.wav format will be also necessary. Singular sounds of musical instruments will be taken from MUMS CD's (Opolko and Wapnick, 987), since these sounds are already broadly used in musical instrument sound research, and can be considered as standard data. I assume working with monophonic data, although sources will be probably stereophonic in many cases. My goal is to elaborate the technique that can be applied to as many files as possible, and with development of surround sound formats and ubiquity of stereo files, selection of any non-singular number of channels would be limitation to application of the obtained results. Techniques elaborated for single channel can be applied to any track recorded in multi-channel audio file, and can be used as a basis for further development of this research, focused on spatial sound (with investigation of spatial information for improvement of sound identification). Page 7

9 Methodology for Sound Processing To start any processing of digital musical sound, raw audio samples must be replaced by sound parameters. In order to index correctly sequences of samples with labels representing instruments or pitch, segments of audio data must be taken into consideration. Generally, segmentation can be performed in many ways. For instance, user-defined segments may contain the theme of the 8 piece, characteristic passage, and so on. In the case of automatic processing of unknown sound data, we do not have any hints where we should put the segments. Therefore, the solution is to analyze the whole file with small frame of analysis, sliding consequently through sound samples. The established practice is to use the sliding frame of constant size, with overlap, and hop size around half of sliding frame. I are going to follow this scheme, since it allows quite precise labelling of the audio data. Additional virtue of sliding frame technique is the possibility of tracing changes of sound features in time. Time-variance of sound features is specific for various instruments, for instance, uctuation of pitch in vibrato, fast fading away (like in plucked string), sudden onset and so on. Humans to recognize instrument also trace changes of sound in time. Especially starting transient, i.e. the attack, when changes in sound wave are dramatic, is very important to human listeners for correct recognition of the instrument. Therefore, classification algorithm should also base on tracking sound changes in time. Basic sound parameters extracted in the analyzing frame will be based on MPEG-7 low-level audio descriptors, as it was mentioned in Section. I are going to use both time domain and frequency domain descriptors, in order to exploit any information available for the analyzed frame. These descriptors are suited for automatic extraction, and standard representation will assure interoperability of this description. Hierarchical Classification of Musical Instrument Sounds The MPEG-7 descriptors extracted for consequent analyzing frame are treated as a starting point for further data processing. In order to trace evolution of sound features, we plan to elaborate intermediate descriptors that will provide internal representation of sound in my recognition system. These descriptors will characterize temporal patterns, specific for particular instruments or instrument groups. The groups may represent instrument family, or articulation (playing technique) applied to the sounds. This is why my system will apply hierarchical classification of musical instrument sounds. The family groups will include basically aerophones and chordophones, according to Hornbostel and Sachs classification (Hornbostel and Sachs, 94). In case of chordophones, we are going to focus on bowed lutes family that includes violin, viola, cello and double bass. The investigated aerophones will include utes, single reed (clarinet, sax) and double reed (oboe, bassoon) instruments, sometimes called woodwinds, and lip-vibrated, brass instruments, with trumpet, trombone, tuba, and French horn. The articulation applied will include vibrato, pizzicato, and muting. Hierarchical classification is also one of the means to facilitate correct recognition of musical instrument sounds. Also, obviously classification on the family level yielded better results, as reported in the research performed so far, for instance in (Martin and Kim, 998) and (Wieczorkowska, 999). Another argument for hierarchical classification is that for the user, information about the instrument family or articulation may be suficient. For example, non-expert user may just look for brass-performed theme, or melody played with sweet vibration, delicate pizzicato motif and so on. Not to mention that some of the users simply may not be familiar with sounds of all instruments, and they may not know how the particular instrument sounds like. Since we are going to deal with real recordings, the audio data may contain various kinds of sounds, including non-pitched percussive sounds, and in further development of the system, also singing or speech. Therefore, my system should start with classification of type of the signal (speech, music, pitched/non-pitched), using simple criteria like zero-crossing rate (Foote, 999), and performing elements of auditory scene analysis and recognition (Peltonen, Tuomi, Klapuri, 9 Huopaniemi and Sorsa 00), (Rosenthal and Okuno, 998), (Wyse and Smoliar, 998). Then, for pitched musical instrument sounds, the system will proceed with further specification, to get as much information as possible from the audio signal. Knowledge Discovery Process Page 8

10 Methodology for Data Storage, Learning Process, Query Answering The basic goal of this research is automatic segmentation of audio signal with respect to musical instrument sounds. The Decision attribute will consist of three attributes, pitch, instrument and type. The features attributes will incorporate intensity, fundamental freq., Spectral parameters and phase attributes. The segmentation starts with sliding frame of analysis, which is a basis for MPEG-7 low- level descriptors production. Analysis of energy and spectral difierences in time allows detecting Attributes FEATURES Decision Attributes The process of analyzing data cannot be restricted to the construction of the classifier. In the case of musical instrument sound analysis, we have to extract sound representation, i.e. choose the most appropriate set of descriptors and calculate values of these descriptors for a given data. Therefore, it is advisable to consider the whole process in terms of a broader methodology. Knowledge Discovery is a process, which consists of the various important steps. The initial stages of this process consist in understanding application domain and determining a goal of the investigation (already defined in my case). The next steps include creating or selecting a target data set and preprocessing, involving digital sound analysis in this research. This stage will require a huge efiort in my case. Data transformation, and finally perhaps data reduction, using techniques like principal component analysis, pose another task to fulfill, which is essential when the representation is inconvenient (or of little use) to deal with, and the data are represented with enormous number of descriptors. The proper selection of the feature set is crucial for eficiency of classification algorithm. In some cases, a set of descriptors is worth transforming into more suitable form before it is used to model the data. For instance, before describing the data set by decision rules, one may transform descriptor values to gain higher support of rules, keeping their accuracy, and increasing generality of a model. Such a transformation can be necessary for various kinds of feature domains: numeric, symbolic, and so on. In my research, we are going to deal with numerical attributes only. After such processing of the data, the next steps of knowledge discovery process involve selection of data mining method, algorithms and their parameters. The constructed model is finally applied and the results should be interpreted to sum up the whole discovery process applied to the investigated data. wave an an+ ax- ax Intensity = loudness Fundamental frequency = pitch Spectral shape = timbre Phase difference in binaural hearing = location Pitch 3C 3C# 3D 3D# Instrument Violin Violin Violin Violin Type Bowed Bowed Bowed Bowed onsets of consequent musical events, i.e. consequent notes or chords (Foote and Uchihashi, 00); such segments can be later joined in hierarchical segmentation into passages of the same instrument(s). Lowlevel descriptor values are calculated for consequent, overlapping frames within given musical event, thus creating time series. Next, these time series, i.e. temporal changes of the low-level descriptors within each segment (musical event) are traced, and temporal patterns are discovered. At this stage, knowledge discovery methods will be applied. One of the main goals of data analysis is to construct models, which properly classify objects to some predefined classes. Reasoning with data can be stated as a classification problem, concerning prediction of decision class basing on information provided by some attributes. The classifier must learn to recognize musical instrument sounds. In order to realize this task, we need to prepare the training data in the form of decision table S = (U;A [ fdg), where each element u U represents a sound sample, each element a A is a numeric feature (attribute) corresponding to one of sound descriptors, and decision attribute 6 A labels particular object sound with integer code adequate to instrument. Attribute a A can be put into the decision table S in various ways. The attributes 0 can represent basic sound descriptors, their changes in time, their mutual Page 9

11 dependencies, etc. I assume starting with the set of low-level sound descriptors representing spectral or time domain features of the analyzed frame of the sound sample. After the preprocessing, we can divide the audio data into segments that are homogeneous with respect to the fundamental frequency or frequencies of the content, and work with such segments. Let us suppose that a new attribute, added to the basic set, should approximate the time-variant behavior of some low-level sound feature within the segment. The curve of the descriptor's evolution can be represented as a fixed-length sequence of descriptor's values, defining an approximation space. After introducing some distance measure in such space of approximations, we can apply one of basic clustering (grouping) methods to find the most representative curves that will distinguish musical instruments in audio recordings. Basic descriptors itself can be also used to recognize musical instrument sounds, and there are descriptors that have already been used in such investigations. For instance, features of the steady-state of the sound were applied, including brightness of sound, contents of selected groups of partials in the spectrum, and so on. On the other hand, features describing time-domain shape of sound, like attack (starting transient) time, were also used. I propose a new approach, Temporal Abstraction Extraction wave an an+ ax- ax Pitch Instrument Type 3C Violin Bowed 3C# Violin Bowed 3D 3D# Violin Violin Bowed Bowed consisting in construction of new attributes that describe time-variance of all basic attributes, both timedomain and spectral-domain ones. Fundamentally all attributes will be classified as Temporal, abstraction and extraction. The problem to solve, when constructing attributes describing time behavior of basic descriptors, is the length of the sliding window of basic analysis that produces these low-level descriptors. Usually, the length of windows is constant for the whole analysis, with the most common length of about 0-30 ms. For instance, Martin and Kim (998) applied 5 ms window, and Brown, Houix and McAdams (00) used 3 ms window. Such windows are sufficient for most sounds, since they contain at least a few periods of the recorded sounds. However, they are too short for analysis of the lowest musical sounds (of very long period), and too long for analysis of short pizzicato sounds, where changes are very fast, especially in case of higher sound. This is why we plan to experiment not only with 5 ms window and half frame overlap, but also with the length of the analyzing frame set up as multiple the fundamental period of the lowest sound in this segment. At this stage of research, we are going to use knowledge discovery methods again, in order to find the best length of the analyzing frame, as well as the length of the overlap for the neighboring frames inside the segment. I will decompose each homogeneous segment into such frames and calculate value sequences for each of them. For each particular attribute, the obtained sequence of its values creates the time series, which can be further analyzed in various ways. For the set of all series (for all basic attributes) in the selected segment, we may look for temporal templates T in the following form: T = fp; ts; teg, fi ts fi te fi n, where P = fa V : a B fi A, V fi Va, V 6= ;g and Va is the set of all possible values of an attribute a. P denotes template and [ts; te] is a period of occurrence for this template. In one segment we can find several temporal templates. They can be time dependent, i.e. one can occur before or after another. Though, we can treat them as a sequence of events, consistent with the characteristics of evolution of the sound descriptors in time. Depending on the needs, we can represent such sequences purely in terms of the temporal template occurrence in time or focusing on the entire specifications of templates. From sequences of templates we can discover frequent episodes collections of templates occurring together (Slezak, Synak, Wieczorkowska, and Wroblewski, 00). An episode occurs in a sequence of templates if each element (template) of episode exists in a sequence and order of occurrence is preserved. I expect Page 0

12 some of such episodes to be specific only for particular instrument or group of instruments. I plan to construct the new sound descriptors based on episodes. New descriptors can also be formed taking into account existing relations between sets of descriptors. Namely, we can construct a parameterized space of candidate descriptors based on available relations (for instance, SQL-like aggregations). Then, we can search that space for an optimal candidate descriptor by verifying which one seems to be the best for constructing decision models. This approach requires application of knowledge discovery methods. Another very important application of knowledge discovery methods in this project is searching for possibly the most successful classification method. Classifiers used so far in the worldwide research were applied to various data sets, based on various sounds and parameterization techniques. Therefore, we would like to follow classifiers best performing on such data: k-nn, MPEG- same 7 chosen HMM, recently developed SVM, perhaps TDNN, and compare their performance on the data. The classification should be resistant to presence of background noise and accompanying music, and detect as many instruments in simultaneous sounds as possible. In the research performed so far on singular sounds of instruments, many classifiers have been applied (see Section.). The simplest and at the same time one of the most successful classifiers is k-nn. So, we plan to start with experiment on k-nn, Harmonic Features Harmonic Features Perceptual Perceptual Features Features Temporal Features Temporal Features Energy Features Energy Features Various Features Various Features Spectral Features Spectral Features Fundamental frequency Fundamental fr. Modulation Noisiness Inharmonicity Harmonic Spectral Deviation Odd to Even Harmonic Ratio Harmonic Tristimulus Harmonic Spectral Shape HarmonicSpectralCentroid HarmonicSpectralSpread HarmonicSpectralSkewness HarmonicSpectralKurtosis HarmonicSpectralSlope HarmonicSpectralDecrease HarmonicSpectralRollOff HarmonicSpectral variation Loudness RelaitveSpecific Loudness Sharpness Spread Perceptual Spectral Envelope Shape Perceptual Spectral Centroid Perceptual Spectral Spread Perceptual Spectral Skewness Perceptual Spectral kurtosis Perceptual Spectral Slope Perceptual Spectral Decrease Perceptual Spectral Rolloff Perceptual Spectral Variation Odd to Even Band Ratio Band Spectral Deviation Band Tristimulus Total Energy Total Energy Mod Total Harmonic Energy Total Noise Energy Spectral Flatness Spectral Crest Spectral Shape Spectral Centroid Spectral Spread Spectral Skewness Spectral Kurtosis Spectral Slope Spectral Decrease Spectral Rolloff Spectral Variation Instantaneous Temporal Features Instantaneous Temporal Features Global Temporal Features Global Temporal Features Global Spectral Shape Descriptors Global Spectral Shape Descriptors Signal Auto-correlation function Zero-crossing rate Log Attack Time Temporal Increase Temporal Decrease Temporal Centroid Effective Duration MFCC Delta MFCC Delta Delta MFCC searching for the best k and also for a metric necessary to calculate distances between data points. Another parameter of the classifier worth to explore is the parameter showing what points should be used as representatives of classes (instruments): gravity centers of data clusters, data points itself, or other points. Investigation of many parameters is a common problem, to deal with, for any classifier. All classification methods are based on specific parameters, which, when appropriately adjusted, may produce successful classifier, and when poorly adjusted, may yield a classifier with low accuracy rate for given data set. Therefore, classifiers to be applied in this proposed research must be tested with respect to their parameters. Also, TV-tree type structures for storage, classification and retrieval, proposed by (Wieczorkowska & Ras, 00, 003), will be developed further and tested simultaneously with k-nn classifier. I plan to design a new TV-tree type data search engine with built-in conceptual hierarchy for musical sounds. TV-tree type storage structures and related to them search strategies have Page

13 SD ProCurve Sw itch 44M HP J4B 0/00Base-T Ports Mod ul e Status Link X X 3X 4X 5X 6X 3X 4X 5X 6X 7X 8 X Self Test Mode Link Console Mode Power ProCu rve Switch 44M Ac t Fdx 0 0 HP J4B Reset Clear 0/00Base-T Ports Fault Mode Select 7X 8X 9 X 0 X X X 9X 0X X X 3X 4 X Mod ul e Status Li nk X X 3X 4X 5X 6X 3X X 4X X 5X 6X X 7X 8X Self Test Mode Li nk Con sol e Mode Power Act Fdx 00 Re set Cle ar Fault Mode Select 7 X 8 X 9X 0X X X 9X X 0X X X X X 3X 4X SD 3Com been quite successfully used in text-retrieval (Subramanian, 998) area. However, we are not going to limit ourselves to one classifier only. Decision trees, rough set based algorithms and possibly Bayes decision rules and neural networks (so far applied to small sets in such investigations) will be also investigated. Since various classifiers may perform better on specific subsets of data, we do not exclude the possibility of developing the final classifier as a combined one, based on the most successful Implementing MIRAI EMO: Emotionbased EMO: Music Emotion- Ontology based Music Ontology Music notation Music notation Midi Midi Text Text Spectrum Analysis Spectrum Analysis Binary Binary Audio Audio Instruments Scales Rhythms classifiers. Finally, we plan to develop a new adaptable TV-tree type data structure (query driven) which jointly with FS-trees (Subramanian, 998) will be used to store musical data. FS-trees are indexdriven and they also well model temporal data. Definitions of new indexes, an outcome of the knowledge discovery process, will be used to partition some of the existing segments in FS-tree. Query answering system for audio data will accept not only queries built from indexes, but also which contain some semantic type of information not well expressible in index-type languages. In this case, we will search TV-tree for audio segments representing a closest match. The precision and recall of the query answering system depends on the accuracy of classification which can be assessed in various ways. Standard procedures divide the whole investigated data set into the training and testing part. Leave-oneout procedure is time consuming and usually yields too optimistic results. The classifier is trained and tested here on almost the same data, and the obtained results cannot be considered as a reliable forecast for new data. On the other hand, 70/30 split of the data into 70% for training and 30% for testing does not require so many time-consuming learning phases. However, if the classifier is trained by a few runs, it may give erroneously low results, since it can be under-trained. Therefore, in order to obtain reliable results, the issue of the accuracy testing procedure should be considered as well. Since we are going to deal with large amount of data, we plan to apply 80/0 split of the data, with test samples chosen with equal probability (weighted to number of objects in particular classes) from all classes. This way we can avoid the situation when some classes are not represented during learning phase and the resulting classifier is under-trained. After a classifier that recognizes various musical instruments is built, we should be able to point out which instrument is playing (or is dominating) in a given segment, and in what time moments this instrument starts and ends playing. Additionally, we plan to extract sound parameter called pitch information which is one of the important factors in sound classification. Page

14 By combining melody and timbre information, we should be able to search successfully for favorite tunes played by favorite instruments. I expect that many users will benefit from the outcomes of this research, including researchers investigating neighboring domains, students, sound engineers, and ordinary users of audio and multimedia data. My results will be presented at national and international meetings such as MIR, KDD, PKDD, ICDM, DaWaK, or RIAO. Also, several journal papers describing outcomes of this research will be published. Proposed Thesis Schedule. Database tuples & decision attributes a. Collect audio data for the experiments in all available forms, i.e..wav and other audio files, CDs if necessary, and convert into.wav format [A. Wieczorkowska & MS student, PJIIT]; if needed, this initial collection of data can be later expanded.. Temporal Features attributes a. Extract low-level sound parameters for singular sounds of musical instruments and set up a corresponding database [A. Wieczorkowska, MS students]; parameter extraction can be implemented using very limited test set, so this part of research can be performed parallel to collecting audio data. b. Design models of sound description, based on low-level parameters, for internal representation of the audio data for the classifier Needs to be completed before a formal theory of classification of musical sounds based on these parameters can be developed. 3. Training a. Elaborate hierarchical model of musical instrument sounds classification for pitched instruments of contemporary orchestra [A. Wieczorkowska]; classic Hornbostel and Sachs model will be adopted, with articulation explicitly incorporated into that model. fi Develop a formal theory of classification for musical sounds, represented by the elaborated numerical parameters [Z. Ras & PhD student at UNC-Charlotte]; this theory will be based on the model of sound description elaborated earlier and it will give a feedback if any adjustment to that model is needed. fi Research on possible classifiers for audio data [PhD, MS students]; comparison of performance for various classifiers on the same data is necessary. b. Cross-Validation i. LOOCV ii. K-Fold iii. 0-Fold 4. Development of theoretical model for musical signal segmentation [A. Wieczorkowska]; this stage is a basis for further experiments: a. Experiments on sound segmentation for the investigated data [MS student, PJIIT]; these experiments will give us a feedback to the design of the segmentation model and will lead to model adjustment, if necessary. Preparing the segmented sound database which will be hierarchical model [MS students]; this is very time-consuming b. labelled according to the part of the project. Investigations on the best internal representation of sound for instrument identification purpose, with temporal patterns and relational descriptors [Post Doctoral Associate, PJIIT]; this 4 part of the project should yield innovative tools for sound description and operation, which can be also applied to any sounds. c. Designing TV-tree type structure jointly with FS-tree as a new representation for segmented sound database [Z. Ras & A. Wieczorkowska]; implementation [PhD Student, UNC-Charlotte]; this new structure will be a basis for classification of sounds as well as for stage must be completed before the end of the designing a query answering system; this second year. Page 3

15 d. Elaboration of alternative models for eficient sound representation and classification, including k-nn classifier [Z. Ras, A. Wieczorkowska, Post Doctoral Associate at PJIIT, MS Student at UNC-Charlotte]; it has to be completed before the start of the third year of the project. 5. Designing query answering system: a. This will be based on FS-tree and TV-tree type representation of the musical data [Z. Ras & A. Wieczorkowska]; Implementing and testing its precision and recall [PhD Student & MS Student, UNC-Charlotte]; this is the most important and time-consuming part of the project in the third year. b. Development and implementation of an alternative model for efficient representation and classification of musical sounds [A. Wieczorkowska, Post Doctoral Associate & MS Student at PJIIT]; it will be performed parallel to FS/TV-tree implementation. fi Comparison of two models in practical tests and the development of the final classifier with a Web-based interface [A. Wieczorkowska, Z. Ras, PhD student]; this is the final stage, concluding the research. 6. Testing Low-level audio features There are essentially two ways of describing low-level audio features. One may sample values at regular intervals or one may use AudioSegments to demark regions of similarity and dissimilarity within the sound. Both of these possibilities are embodied in the low-level descriptor types, AudioLLDScalarType and AudioLLDVectorType. A descriptor of either of these types may be instantiated as sampled values in a ScalableSeries, or as a summary descriptor within an AudioSegment. AudioSegment, which is a concept that permeates the MPEG-7 Audio standard. An AudioSegment is a temporal interval of audio material, which may range from arbitrarily short intervals to the entire audio portion of a media document. A required element of an AudioSegment is a MediaTime descriptor that denotes the beginning and end of the segment. The TemporalMask DS is a construct that allows one to specify a temporally non-contiguous AudioSegment. An AudioSegment (as with any SegmentType) may be decomposed hierarchically to describe a tree of Segments. Another key concept is in the abstract datatypes: AudioDType and AudioDSType. Scalable series: These are datatypes for series of values (scalars or vectors). They allow the series to be scaled (downsampled) in a well-defined fashion. Two types are available: SeriesOfScalarType and SeriesOfVectorType. They are useful in particular to build descriptors that contain time series of values. Scaling specifies how the original samples are scaled. If absent, the original samples are described without scaling. Scale ratio is the number of original samples represented by each scaled sample that is common to all elements in a sequence. The value to be used when Scaling is absent is. numofelements is the number of scaled elements in a sequence. The value to be used when Scaling is absent is equal to the value of totalnumofsamples. totalnumofsamples is the total number of samples of the original series (before scaling). It is of interest to note that the last sample of the series may summarize fewer than ratio samples. This happens if totalnumofsamples is smaller than the sum over runs of the product of numofelements by ratio. Scalable series: Page 4

16 This descriptor represents a series of scalars, at full resolution or scaled. Use this type within descriptor definitions to represent a series of feature values. When If the scaling operations are used, they shall be computed as follows. Name Definition Definition if Weight present Min kn m k = mini= + ( k ) N x Ignore samples with zero weight. If all have zero weight, i set to zero by convention. Max kn M k = max i=+(k ) N x i Ignore samples with zero weight. If all have zero weight, set to zero by convention. Mean x = kn ( / N) xk = wi xi k x i i= + ( k ) N kn i= + ( k ) N kn w i i= + ( k ) N If all samples have zero weight, set to zero by convention. Random choose at random among N samples Choose at random with probabilities proportional to weights. If all samples have zero weight, set to zero by convention. First choose the first of N samples Choose first non-zero-weight sample. If all samples have zero weight, set to zero by convention. Last choose the last of N samples Choose last non-zero-weight sample. If all samples have zero weight, set to zero by convention. Variance z k = (/ N) = (/ N) kn i i= + ( k ) N kn xi i= + ( k ) N ( x x x k k ) kn z k = w i (x i x k ) i=+(k ) N kn w i i =+ (k ) N If all samples have zero weight, set to zero by convention. Weight kn w = ( / N) k w i i= + ( k ) N In these formulae, k is an index in the scaled series, and i an index in the original series. N is the number of samples summarized by each scaled sample. The formula for Variance differs from the standard formula for unbiased variance by the presence of N rather than N. Unbiased variance is easy to derive from it. If the Weight field is present, the terms of all sums are weighted. SeriesOfScalarBinaryType Mirai will use this type to instantiate a series of scalars with a uniform power-of-two ratio. The restriction to a power-of-two ratio eases the comparison of series with different ratios as the decimation required for the comparison will also be a power of. Such decimation allows perfect comparison. It also allows an additional scaling operation to be defined (scalewise variance). Considering these computational properties of power-of-two scale ratios, the SeriesOfScalarBinaryType is the most useful of the Scalable Series family. Page 5

17 Note that the types SeriesOfScalarBinaryType and SeriesOfVectorBinaryType inherit from the appropriate non-binary type. This means that although they are not used directly in this document, they can be used in place of the non-binary type at any time. Scalewise variance Scalewisevarience is a decomposition of the variance into a vector of coefficients that describe variability at different scales. The sum of these coefficients equals the variance. To calculate the m scalewise variance of a set of N = samples, first recursively form a binary tree of means: x k = (x k + x k )/, k =,...N / x = ( x + x ) /, k =,...N / 4 k k k x m m k = (x m k + x k )/, k = where x is a sample. Then calculate the coefficients z : N / z = ( / N) (x k x k ) / z = (4 / N) k = N /4 k = (x k x k ) / z m m = (x m k x k ) / The vector formed by these coefficients is the scalewise variance for this group of samples. The VarianceScalewise field stores a series of such vectors. SeriesOfVectorType This descriptor represents a series of vectors most of which are straightforward extensions of operations previously defined in section Error! Reference source not found. for series of scalars, applied uniformly to each dimension of the vectors. Operations that are specific to vectors are defined here: Name Definition Definition if Weight present Covariance kn kn kn jj' j j j' jj' j j j' j' σ k = ( xi x )( xi σ k = N wi ( xi x )( xi x ) wi i= + ( k ) N i= + ( k ) N i= + ( k ) N VarianceSummed D kn j ( xi j= i= + ( k ) N zk = (/ N) D kn kn j j x zk = wi ( xi xi ) j= i= + ( k ) N w i i= + ( k ) N If all samples have zero weight, set to zero by convention. MaxSqDist kn MSDk = maxi= + ( k ) N x i Ignore samples with zero weight. If all samples have zero weight, set to zero by convention Page 6

18 In these formulae, k is an index in the scaled series, and i an index in the original series. N is the number of vectors summarized by each scaled vector. D is the size of each vector and j is an index into each vector. x is the mean of j i N samples. The various variance/covariance options offer a choice of several cost/performance tradeoffs for the representation of variability. SeriesOfVectorBinaryType Used to instantiate a series of vectors with a uniform power-of-two ratio. The restriction to a powerof-two ratio eases the comparison of series with different ratios as the decimation necessary for the comparison is simply another power of. The use of power-of-two scale ratios is recommended. Low level Audio Descriptors Low-level Audio Descriptors (LLDs) consist of a collection of simple, low complexity descriptors that are designed to be used within the AudioSegment framework, see ISO-IEC (E). Whilst being useful in themselves, they also provide examples of a design framework for future extension of the audio descriptors and description schemes. All low-level audio descriptors are defined as subtypes of either AudioLLDScalarType or AudioLLDVectorType, except the AudioSpectrumBasis There are two description strategies using these data types: single-valued summary and sampled series segment description. These two description strategies are made available for the two data types, Scalar/SeriesOfScalarType and Vector/SeriesOfVectorType, and are implemented as a choice in DDL. When using summary descriptions (containing a single scalar or vector) there are no normative methods for calculating the single-valued summarization. However, when using series-based descriptions, the summarization values shall be calculated using the scaling methods provided by the SeriesOfScalarType and SeriesOfVectorType descriptors, such as the min, max and mean operators. AudioLLDScalarType Abstract definition inherited by all scalar datatype audio descriptors. Scalar is the calue of the descriptor. SeriesOfScalar Scalar values for sampled-series description of an audio segment. Use of this scalable series datatype promotes compatibility between sampled descriptions. hopsize is the time interval between data samples for series description. The default value is PT0N000F which is 0 milliseconds. Values other than the default shall be integer multiples/divisors of 0 milliseconds. This will ensure compatibility of descriptors sampled at different rates. Audio descriptors that calculated at regular intervals (sample period or frame period) shall use the hopsize field to specify the extraction period. In all cases, the hopsize shall be a positive integer multiple/divisor of the default 0 millisecond sampling period. Note that downsampling by means of the scalable series does not change the specified hopsize but instead specifies the downsampling scale ratio to be used together with the hopsize. AudioLLDScalarType and AudioLLDVectorType are both abstract and therefore never instantiated. AudioWaveFormType AudioWaveForm describes the audio waveform envelope, typically for display purposes. D also allows economical display of an audio waveform. For example, a sound editing application program can display a summary of an entire audio file immediately without processing the audio data and data may be displayed and edited over a network, etc. Whatever the number of samples, the waveform may be displayed using a small set of values that represent extrema (min and max) of frames of samples. Min and max are stored as scalable time series within the AudioWaveform They may also be used for fast comparison between waveforms. AudioPowerType Page 7

19 AudioPower describes the temporally-smoothed instantaneous power (square of waveform values). Instantaneous power is calculated by taking the square of waveform samples. These are averaged over time intervals of length corresponding to hopsize and stored in the Mean field of a SeriesOfScalarType. Instantaneous power is a useful measure of the amplitude of a signal as a function of time, P(t)= s(t). In association with AudioSpectrumCentroid and AudioSpectrumSpread D, the AudioPower provides an economical description of the power spectrum (spreading the power over the spectral range specified by the centroid and spread) that can be compared with a log-frequency spectrum. Another possibility is to store instantaneous power at high temporal resolution, in association with a high spectral resolution power spectrum at low temporal resolution, to obtain a cheap representation of the power spectrum that combines both spectral and temporal resolution. Instantaneous power is coherent with the power spectrum. A signal labeled with the former can meaningfully be compared to a signal labeled with the latter. Note however that temporal smoothing operations are not quite the same, so values may differ slightly for identical signals. AUDIO SPECTRUM DESCRIPTORS AudioSpectrumAttributeGrp The AudioSpectrumAttributeGrp defines a common set of attributes applicable to many of the spectrum descriptions. AudioSpectrumEnvelopeType AudioSpectrumEnvelope describes the spectrum of the audio according to a logarithmic frequency scale. describes the short-term power spectrum of the audio waveform as a time series of spectra with a logarithmic frequency axis. It may be used to display a spectrogram, to synthesize a crude "auralization" of the data, or as a general-purpose descriptor for search and comparison. A logarithmic frequency axis is used to conciliate requirements of concision and descriptive power. Peripheral frequency analysis in the ear roughly follows a logarithmic axis The power spectrum is used because of its scaling properties (the power spectrum over an interval is equal to the sum of power spectra over subintervals). AudioSpectrumEnvelopeType is the description of the power spectrum of the audio signal. The spectrum consists of one coefficient representing power between 0Hz and loedge, a series of coefficients representing power in resolution sized bands, between loedge and hiedge, and a coefficient representing power beyond hiedge, in this order. The range between loedge and hiedge is divided into multiple bands. The resolution, in octaves, of the bands is specified by resolution. Except for when the octaveresolution is /8, both loedge and hiedge must be related to khz as described in the following equation: rm edge = KHz where r is the resolution in octaves, m Z (i.e., m an integer). For the case when resolution is "8 octave" the spectrum delivers a single coefficient representing within-band power, and two extra coefficients for below-band and above-band power. In this case the default values for loedge and hiedge should be used. If ml and mh are the integers corresponding to Equation 0. when edge equals loedge and hiedge, respectively, then the full set of band edges are given by edge = rm KHz, ml m mh. In every case the spectrum contains two extra values, one representing the energy between 0Hz and loedge, and the other energy between hiedge and half the sampling rate (See Error! Reference source not found.). If hiedge equals half the sampling rate then the second extra value is set to zero. These values measure the "out-of-band" energy. Default hiedge is 6000Hz (corresponding to the upper limit of hearing). Default loedge is 6.5 Hz (8 octaves below hiedge). The default analysis frame period is 0 ms, which is within the range of estimates for Page 8

20 temporal acuity of the ear (8 to 3 ms) and is also the default analysis frame period for sampled audio descriptors. To extract the AudioSpectrumEnvelope the following method is recommended. The method involves a sliding window FFT analysis, with a resampling to logarithmic spaced bands. Determine the required hop length h, corresponding to the hopsize. If the sampling rate is sr, then h = sr*hopsize (e.g. h = 6000*0.0 = 60 samples). If sr*hopsize is not a whole number of samples then generate a vector h such that mean(h) = sr*hopsize (e.g. sr*hopsize = 050 * 0.0 = 0.5, h = [0 ]). By cycling through the vector of hop lengths the analysis will not stray over time, but will give minor jitter from the defined hopsize. This enables reasonable comparison of data sampled at differing rates. Determine the analysis window length lw. The analysis window has been chosen to have a default value of 3 hopsizes, 30ms. This is to provide enough spectral resolution to roughly resolve the 6.5 Hz-wide first channel of a octave resolution spectrum. Determine the FFT size, NFFT. NFFT is the next-larger power-of-two number of samples from lw, e.g. If lw = 33 samples then NFFT would be 048. Perform a STFT using a Hamming window of length lw, a shift of h samples (where h is a vector, rotate through the vector to prevent stray, and deliver minimal jitter), using a NFFT point FFT, with out-of-window samples set to 0. The descriptor only retains the square magnitude of the FFT coefficients, X w (k). The sum of the power spectrum coefficients is equal to the average power in the analysis window, P w. By Parseval s theorem there is a further factor of /NFFT to equate the sum of the squared magnitude of the FFT coefficients to the sum of the squared, zero-padded, windowed signal. P w = lw lw NFFT xw( n) = n= 0 lw NFFT k= 0 X w ( k) where x w ( n) = s( n) * w( n), 0 n < lw and w(n) is the Hamming window of length lw. Hence Px ( k) = X w( k) lw NFFT Since the audio signal is a real signal its Fourier transform has even symmetry. Hence only the spectral coefficients up to the Nyquist frequency need be retained. Resample to a logarithmic scale. Let DF be the frequency spacing of the FFT (DF = sr/nfft). An FFT coefficient more than DF/ from a band edge is assigned to the band. A coefficient less than DF/ from a band edge is proportionally shared between bands, as illustrated in Error! Reference source not found.. Important Note: Due to the weighting method illustrated in Error! Reference source not found. it is important to select an appropriate loedge at fine frequency resolutions. To be able to resolve the logarithmic bands there needs to be at least one FFT coefficient in each band. In some cases this means that the default loedge is unsuitable. Error! Reference source not found. indicates the minimum value that loedge should be set to for some popular sampling frequencies, assuming default hopsize. AudioSpectrumCentroidType AudioSpectrumCentroid describes the center of gravity of the log-frequency power spectrum. The SpectrumCentroid is defined as the power weighted log-frequency centroid. To be coherent with other descriptors, in particular AudioSpectrumEnvelope D, the spectrum centroid is defined Page 9

21 as the center-of-gravity of a log-frequency power spectrum. This definition is adjusted in the extraction to take into account the fact that a non-zero DC component creates a singularity, and eventual very-low frequency components (possibly spurious) have a disproportionate weight. Spectrum centroid is an economical description of the shape of the power spectrum. It indicates whether the power spectrum is dominated by low or high frequencies and, additionally, it is correlated with a major perceptual dimension of timbre; i.e.sharpness. To extract the spectrum centroid, calculate the power spectrum coefficients, as described in AudioSpectrumEnvelope extraction parts a-d. There are many different ways to design a spectrum centroid, according to the scale used for the values (amplitude, power, log power, cubic root power, etc.) and frequencies (linear or logarithmic scale) of spectrum coefficients. Perceptual weighting and masking can also be taken into account in more sophisticated measures. This particular design of AudioSpectrumCentroid was chosen to be coherent with other descriptors, in particular AudioSpectrumEnvelope D, so that a signal labeled with the former can reasonably be compared to a signal labeled with the latter. NFFT P x ( k), k = 0,.., Power spectrum coefficients below 6.5 Hz are replaced by a single coefficient, with power equal to their sum and a nominal frequency of 3.5 Hz. bound = P (0) = x 6.5 NFFT floor sr P ( k), f (0) = 3.5 P ( n) = P ( n + bound), f ( n) = ( n + bound) x bound k= 0 x x sr NFFT NFFT where n =,.., bound Frequencies of all coefficients are scaled to an octave scale anchored at khz. The spectrum centroid is calculated as: C = log ( f ( n) /000) P x ( n) Px ( n) n n AudioSpectrumSpreadType AudioSpectrumSpread Describes the second moment of the log-frequency power spectrum. To be coherent with other descriptors, in particular AudioSpectrumEnvelope D, the spectrum spread is defined as the RMS deviation of the log-frequency power spectrum with respect to its center of gravity. Details are similar to AudioSpectrumCentroid To extract the spectrum spread: Calculate the power spectrum, P x (n), and corresponding frequencies, f (n), of the waveform as for AudioSpectrumCentroid extraction, parts a-b. Calculate the spectrum centroid, C, as described in AudioSpectrumCentroid extraction part Calculate the spectrum spread, S, as the RMS deviation with respect to the centroid, on an octave scale: S = ((log ( f ( n) /000) C) Px ( n)) Px ( n) n n Spectrum spread is an economical descriptor of the shape of the power spectrum that indicates whether it is concentrated in the vicinity of its centroid, or else spread out over the spectrum. It allows differentiating between tone-like and noise-like sounds. Page 0

22 AudioSpectrumFlatnessType AudioSpectrumFlatness Describes the flatness properties of the spectrum of an audio signal within a given number of frequency bands. The AudioSpectrumFlatnessType describes the flatness properties of the short-term power spectrum of an audio signal. This descriptor expresses the deviation of the signal s power spectrum over frequency from a flat shape (corresponding to a noiselike or an impulse-like signal). A high deviation from a flat shape may indicate the presence of tonal components. The spectral flatness analysis is calculated for a number of frequency bands. It may be used as a feature vector for robust matching between pairs of audio signals. The extraction of the AudioSpectrumFlatnessType can be efficiently combined with the extraction of the AudioSpectrumEnvelopeType and is done in several steps: A spectral analysis (windowing, DFT) of the input signal is performed using the same procedure and parameters specified for the extraction of the AudioSpectrumEnvelopeType part a-d, but with the window length, lw, corresponding to hop size (i.e. no overlap between subsequent calculations). Hence hopsize = 30ms is recommended for this descriptor. A frequency range from loedge to hiedge is covered. Both limits must be chosen in quarter octave relation to khz as described in the following equation, i.e. edge = 0. 5m KHz where m Z (i.e., m an integer). In view of the limitations in available frequency resolution, use of AudioSpectrumEnvelopeType below 50 Hz is not recommended. A logarithmic frequency resolution of a /4 octave is used for all bands. Thus, all AudioSpectrumFlatnessType bands are commensurate with the frequency bands employed by AudioSpectrumEnvelopeType. In order to reduce the sensitivity against deviations in sampling frequency, the bands are defined in an overlapping fashion: For the calculation of the actual edge frequencies, the nominal lower edge and higher edge frequencies of each band are multiplied by the factors 0.95 and.05, respectively. Consequently, each band overlaps with its neighbor band by 0%. This results in band edges fb as described in Error! Reference source not found. (assuming the default loedge value of 50 Hz). The band edge frequencies are transformed to indices of power spectrum coefficients as follows: If DF is the frequency spacing of the DFT (DF = sampling rate / DFT size), the lower and higher edge of band b are defined by their power spectrum coefficient indices, il(b) and ih(b), respectively, which are derived from the edge frequencies by nint(f b / DF), where nint() denotes rounding to the nearest integer. For each frequency band, the flatness measure is defined as the ratio of the geometric and the arithmetic mean of the power spectrum coefficients (i.e. squared absolute DFT value, incl. grouping if required) c(i) within the band b (i.e. from coefficient index il to coefficient index ih, inclusive). SFM b ih( b) il( b) + ih( b) i= il( b) = ih( b) ih( b) il( b) + c( i) i= il( b) c( i) If no audio signal is present (i.e. the mean power is zero), a flatness measure value of is returned. In order to reduce the computational effort and adapt the frequency resolution to log bands, all power spectrum coefficients in bands above the edge frequency of khz are grouped, i.e. the above calculation is carried out using the average values over a group of power spectral coefficients rather than the single coefficients themselves. The grouping is defined in the following way: For all bands between nominal khz and khz, a grouping of consecutive power spectrum coefficients is used. For all bands between nominal khz and 4kHz, a grouping of 4 consecutive power spectrum coefficients is used. For all bands between nominal 4kHz and 8kHz, a grouping of 8 consecutive power spectrum coefficients is Page

23 used and so on. For the last group of coefficients in each band, the following rule is applied: If at least 50% of the required coefficients for the group are available in that band, this last group is included using the necessary amount of additional coefficients from the successive band. Otherwise this group is not included, and the number of coefficients used from the particular band is reduced accordingly. If the signal available to the extraction process does not supply proper signal content beyond a certain frequency limit (e.g. due to the signal sampling rate or other bandwidth limitations), no flatness values should be extracted for bands extending beyond this frequency limit. Instead, hiedge should be reduced accordingly to signal the number of bands available with proper flatness data. AudioSpectrumBasisType The AudioSpectrumBasis Contains basis functions that are used to project high-dimensional spectrum descriptions into a low-dimensional representation. Spectrum dimensionality reduction plays a substantial role in automatic classification applications by compactly representing salient statistical information about audio segments. These features have been shown to perform well for automatic classification and retrieval applications. Statistical basis functions of a spectrogram used for dimension reduction and summarization. Basis functions are stored in the Raw field of a SeriesOfVector, the dimensions of the series depend upon the usage model: For stationary basis components the dimension attribute is set to dim= N K where N is the spectrum length and K is the number of basis functions. For time-varying basis components dim= M N K where M is the number of blocks within the segment, N is the spectrum length and K is the number of basis functions per block. Block lengths must be at least K frames for K basis functions; default hopsize is PT500N000F. To extract a reduced-dimension basis from an AudioSpectrumEnvelope spectrum the following steps shall be executed: Power spectrum: instantiate an AudioSpectrumEnvelope descriptor using the extraction method defined in the AudioSpectrumEnvelope The resulting data will be a SeriesOfVectors with M frames and N frequency bins. Log-scale norming: for each spectral vector, x, in AudioSpectrumEnvelope, convert the power spectrum to a decibel scale: χ = 0log0( x t ) and compute the L-norm of the resulting vector: r = N χ k k = the new unit-norm spectral vector is given by: ~ χ x = r Observation matrix: place each normalized spectral frame row-wise into a matrix. The size of the resulting matrix is M x N where M is the number of time frames and N is the number of frequency bins. The matrix will have the following structure: ~ x ~ x ~ X = M M ~ xm T T T Page

24 Basis extraction: Extract a basis using a singular value decomposition (SVD), commonly implemented as a built-in function in many mathematical software packages using the command [U,S,V] = SVD(X,0). Use the economy SVD when available since the row-basis functions are not required and this will increase extraction efficiency. The SVD factors the matrix from step (c) in the following way: ~ T X = USV where X is factored into the matrix product of three matrices; the row basis U, the diagonal singular value matrix S and the transposed column basis functions V. Reduce the spectral (column) basis by retaining only the first k basis functions, i.e. the first k columns of V: V = [ v v L ] K v k k is typically in the range of 3-0 basis functions for sound classification and spectrum summarization applications. To calculate the proportion of information retained for k basis functions use the singular values contained in matrix S: I ( k) k i= = N where I(k) is the proportion of information retained for k basis functions and N is the total number of basis functions which is also equal to the number of spectral bins. The SVD basis functions are stored using a SeriesOfVector in the AudioSpectrumBasis Statistically independent basis (Optional): after extracting the reduced SVD basis, V, a further step consisting of basis rotation to directions of maximal statistical independence is required for some applications. This is necessary for any application requiring maximum separation of features; for example, separation of source components of a spectrogram. A statistically independent basis is derived using an additional step of independent component analysis (ICA) after SVD extraction. The ICA basis is the same size as the SVD basis and is placed in the same SeriesOfVector field as the SVD basis. j= S S ii jj Audio Window Spectrum Envelope Spectrum Normalization db Scale X ~ Extraction: SVD / ICA Vk X ~ Stored Basis Functions L Norm r = N z k k= r Basis Projection Features ~ ~ Y = X k V k Time varying components (Optional): the extraction process (a)-(e) outlined above can be segmented into blocks over an AudioSegment thus providing a time-varying basis. To do this, the basis is sampled at regular intervals, default 500ms (hopsize = PT500N000F), and a three-dimensional SeriesOfVector matrix results. The first dimension is the block index, the second is the spectral Page 3

25 dimension and the third gives the number of basis vectors. This representation can track basis functions belonging to sources in an auditory scene. AudioSpectrumProjectionType The AudioSpectrumProjection is the compliment to the AudioSpectrumBasis and is used to represent low-dimensional features of a spectrum after projection against a reduced rank basis. These two types are always used together. The low-dimensional features of the AudioSpectrumProjection consist of a SeriesOfVectors, one vector for each frame, t, of the normalized input spectrogram, x~ t. Each spectral frame from steps (a)-(c) above yields a corresponding projected vector, y t, that is stored in the SeriesOfVector AudioSpectrumProjectionType is a low-dimensional representation of a spectrum using projection against spectral basis functions. The projected data is stored in a SeriesOfVector The dimensions of the SeriesOfVector depend upon the usage model: For stationary basis components the dimension attribute is set to dim= N K+ where N is the spectrum length and K is the number of basis functions. For time-varying basis components dim= M N K+ where M is the number of blocks, N is the spectrum length and K is the number of basis functions per block. The elements of each AudioSpectrumProjection vector shall represent, in order, the L-norm value, r t, obtained in step (b) of AudioSpectrumBasis extraction. This shall be followed by the inner product of the normalized spectral frame, x~ t, from step (b) above and each of the basis vectors, v k, from step (d) or (e) above. The resulting vector has k+ elements, where k is the number of basis components, and it is defined by: [ ~ T T T x v ~ x v x ] y t r ~ v = t t t L t k. The AudioSpectrumBasis and AudioSpectrumProjection are used in the Sound Classification and Indexing Tools for automatic classification of audio segments using probabilistic models. In this application, basis functions are computed for the set of training examples and are stored along with a probabilistic model of the training sounds. Using these methods, audio segments can be automatically classified into categories such as speech, music and sound effects. Another example is automatic classification of music genres such as Salsa, HipHop, Reggae or Classical. For more information on automatic classification and retrieval of audio see the SoundClassificationModel DS below. The spectrum basis descriptors can be used to view independent subspaces of a spectrogram; for example, we may wish to view suspaces that contain independent source sounds in a mixture. To extract independent spectrogram subspaces for an audio segment, first perform extraction for AudioSpectrumBasis. Then the AudioSpectrumProjection is extracted as defined above. Reconstruction of an independent T spectrogram frame, x t, is calculated by taking the outer product of the jth vector in AudioSpectrumBasis and the j+th vector in AudioSpectrumProjection and multiplying by the normalization coefficient r: T x = r y j + v j t t t [ ] [ ] + where the + operator indicates the pseudo-inverse. These frames are concatenated to form a new spectrogram. Any combination of spectrogram subspaces can be summed to obtain either individual source spectrograms or an approximation of the original spectrogram. The salient features of a spectrogram may be efficiently represented with much less data than a full spectrogram using independent component basis functions. The following example is taken from a recording of a song Page 4

featuring guitar, drums, hi-hat, bass guitar, and organ. Error! Reference source not found. shows the original full-bandwidth spectrogram and Error! Reference source not found. shows a 0-component reconstruction of the same spectrogram.

26 featuring guitar, drums, hi-hat, bass guitar, and organ. Error! Reference source not found. shows the original full-bandwidth spectrogram and Error! Reference source not found. shows a 0-component reconstruction of the same spectrogram. The data ratio, R, between the reduced-dimension spectrum and the full-bandwidth spectrum is: R = K + M N where K is the number of basis components, M is the number of frames in the spectrogram and N is the number of frequency bins. For example, a 5-component summary of 500 frames of a 64-bin spectrogram leads to a data reduction of ~: AudioSpectrumEnvelope description of a pop song. The required data storage is NM values where N is the number of spectrum bins and M is the number of time points Page 5

AudioSpectrumProjection AudioSpectrumBasis Basis component reconstruction showing most of the detail of the original spectrogram including guitar, bass guitar, hi-hat and organ notes.

27 AudioSpectrumProjection AudioSpectrumBasis Basis component reconstruction showing most of the detail of the original spectrogram including guitar, bass guitar, hi-hat and organ notes. The left vectors are an AudioSpectrumBasis and the top vectors are the corresponding AudioSpectrumProjection The two vectors are combined using the reconstruction equation given above. The required data storage is 0(M+N) values AudioFundamentalFrequencyType AudioFundamentalFrequency describes the fundamental frequency of the audio signal. The limits of the search range shall be specified using lolimit and hilimit. The extraction method shall report a fundamental frequency for any signal that is periodic over the analysis interval with a fundamental within the search range. The extraction method shall provide a confidence measure, between 0 and, to be used as a weight in scaling operations. Values of the estimate for which the weight is zero shall be considered non-periodic and ignored in similarity and scaling operations. The handling of non-zero values, that allow periodic values to be differentially weighted, is left up to the specific application. One extraction method is detailed in the extraction of the AudioHarmonicity This is not the best method available but it gives reasonable estimates of the fundamental frequency in stationary signals. Fundamental frequency is a good predictor of musical pitch and speech intonation. As such it is an important descriptor of an audio signal. This descriptor is not designed to be a descriptor of melody, but it may nevertheless be possible to make meaningful comparisons between data labeled with a melody descriptor, and data labeled with fundamental frequency. Fundamental frequency is complementary to the log-frequency logarithmic spectrum, in that, together with the AudioHarmonicity D, it specifies aspects of the detailed harmonic structure of periodic sounds that the logarithmic spectrum cannot represent for lack of resolution. The inclusion of a confidence measure, using the Weight field of the SeriesOfScalarType is an important part of the design, that allows proper handling and scaling of portions of signal that lack clear periodicity. Page 6

28 AudioHarmonicityType AudioHarmonicity describes the degree of harmonicity of an audio signal. AudioHarmonicity contains two measures: HarmonicRatio, and UpperLimitOfHarmonicity. HarmonicRatio is loosely defined as the proportion of harmonic components within the power spectrum. It is derived from the correlation between the signal and a lagged representation of the signal, lagged by the fundamental period of the signal. In order to avoid dependency on the actual fundamental frequency estimate, the algorithm produces its own estimate by searching for the maximum value in the normalized cross-correlation of the signal. The algorithm is: Calculate r( i, k), the normalised cross correlation of frame i with lag k : s is the audio signal r ( i, k) m = i * n, where i = 0, M = frame index, M m+ n m+ n m = ( ) ( ) ( ) * + n s j s j k s j s j= m j= m j= m n = t * sr, where t = analysis window size (default = number of frames 0ms) and ( j k) k =, K = lag, where K = ω * sr, ω = maximum fundamental period expected (default 40ms) 0.5 sr = sampling rate The Harmonic Ratio H (i) is chosen as the maximum r( i, k) in each frame,i : H ( i) = max r( i, k) k=, n This value is for a purely periodic signal, and it will be close to 0 for white noise. The estimate can be refined by replacing each local maximum of r( i, k) by the maximum of a 3-point parabolic fit centered upon it. The UpperLimitOfHarmonicity is loosely defined as the frequency beyond which the spectrum cannot be considered harmonic. It is calculated based on the power spectra of the original and a comb-filtered signal. The algorithm is: Determine the combed signal c ( j) = s( j) λs( j K), j = m, ( m + n ) where λ = m+ n j= m m+ n s ( j) s( j K) s ( j K) is the optimal gain j= m K is the lag corresponding to the maximum cross correlation ( H ( i) = r( i, K) ), and the fundamental period estimate. If K is fractional, s(j-k) is calculated by linear interpolation. Calculate the DFTs of the signal, s(j), and the comb-filtered signal, c(j), using the technique described in AudioSpectrumEnvelope Calculate power spectra, and group the components below 6.5 Hz as explained for f AudioSpectrumCentroid For each frequency, lim, calculate the sum of power beyond that frequency, for both the original and comb-filtered signal, and take their ratio Page 7

29 f f a ( f lim ) = p' ( f ) max f = f lim max f = f lim p( f ) where p(f) and p'(f) are the power spectra of the unfiltered and filtered signal respectively, and the maximum frequency of the DFT. Starting from f u lim f max f lim = f max and moving down in frequency, find the greatest frequency,, for which this ratio is smaller than a threshold (Threshold = 0.5). Convert this value to an octave scale based on khz UpperLimitOfHarmonicity = log ( f /000) u lim A harmonicity measure allows distinguishing between sounds that have a harmonic spectrum (musical sounds, voiced speech, etc.) and those that have a non-harmonic spectrum (noise, unvoiced speech, dense mixtures of instruments, etc.). Together with the AudioFundamentalFrequency D, AudioHarmonicity describes the harmonic structure of sound. These features are orthogonal and complementary to a descriptor such as AudioSpectrumEnvelope The exact definitions of the measures (HarmonicRatio and UpperLimitOfHarmonicity) are designed to be easy to extract, and coherent with the definitions of other descriptors (most of which are based on power). is TIMBRE DESCRIPTORS The Tibre descriptors are distinct from the preceding low-level descriptors. In that are intended to be descriptors that apply to an entire audio segment, rather than being primarily sampled types. However it is possible to retain the instantaneous sampled series for a number of the descriptors, using the SeriesOfScalar of the AudioLLDScalarType. SeriesOfScalar may not be chosen for the LogAttackTime D, the SpectralCentroid and the TemporalCentroid D, as these descriptors are not defined as an instantaneous series. As many of the timbre descriptors rely on a previous estimation of the fundamental frequency and the harmonic peaks of the spectrum or on the temporal signal envelope, the extraction of these is explained first rather than repeating it for each timbre descriptor. The calculation of the fundamental frequency and the harmonic peaks is required before the calculation of each of the instantaneous harmonic spectral features, including centroid, deviation, spread and variation. Many of the timbre descriptors have been designed for specific use upon harmonic signals, such as a monophonic musical signal. Each descriptor describes a sound segment. An example of a sound segment would be a single note played on a clarinet. The recommended parameters for extraction of the timbre descriptors depend upon whether the global values alone are required or whether the instantaneous values are also required. If only the global values of the Timbre descriptors are required then the recommended extraction. The fundamental frequency is the frequency that best explains the periodicity of a signal. While numerous methods have been proposed in order to estimate it, one can simply compute the local normalized auto-correlation function of the signal and take its first maximum in order to estimate the local fundamental period. The local fundamental frequency is then estimated by the inverse of the time corresponding to the position of this maximum. The harmonic peaks are the peaks of the spectrum located around the multiple of the fundamental frequency of the signal. The term around is used in order to take into account the slight variations of harmonicity of some sounds (piano for example). While numerous methods have been proposed in order to estimate the harmonic peaks, one can simply look for the maxima of the amplitude of the Short Time Fourier Transform (STFT) close to the multiples of the fundamental frequency. The frequencies of the harmonic peaks are then estimated by the positions of these maxima while the amplitudes of these maxima determine their amplitudes. To determine the amplitude, A, and frequency, f, of harmonic harmo in the frame frame - Let X(k,frame), k =,N be the STFT (of size N) of the frame, frame, of data: Page 8

30 A( frame, harmo) = max ( X ( m, frame) ) = m [ a, b] f ( frame, harmo) = M DF where DF=sr/N is the frequency separation of coefficients sr is the sampling rate f0 is the estimated fundamental frequency where X ( M, frame) f0 f0 a = floor(( harmo c) ) and b = ceil( harmo + c) ) DF DF c [0,0.5], determines the tolerated non-harmonicity. A value of c=0.5 is recommended. LogAttackTimeType The LogAttackTime is the time duration between the time the signal starts to the time it reaches its stable part. The Units are : [log 0 sec] and the Range: [log 0 (/sr), is determined by the length of the signal] Where sr stands for sampling rate. First Estimate the temporal signal envelope over the time of the segment then Compute the LogAttackTime, LAT, as follows LAT = log 0 ( T T0) Where T0 is the time the signal starts; T is the time the signal reaches its sustained part (harmonic space) or maximum part (percussive space). Signal envelope(t) T0 T Illustration of log-attack time t Typically T0 can be estimated as the time the signal envelope exceeds % of its maximum value. However, with Mirai, I may create a variable to adjust this threshold depending on what the input frequency is. T can be estimated, simply, as the time the signal envelope reaches its maximum value HarmonicSpectralCentroidType The HarmonicSpectralCentroid is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralCentroid within a running window. The instantatneous HarmonicSpectralCentroid is computed as the amplitude (linear scale) weighted mean of the harmonic peaks of the spectrum. Unit: [Hz] Range: [0,sr/]The HarmonicSpectralCentroid may be extracted using the following algorithm: Estimate the harmonic peaks over the sound segment then Calculate the instantaneous HarmonicSpectralCentroid, IHSC, for each frame as follows: Page 9

31 IHSC( frame) = nb _ harmo harmo= f ( frame, harmo) A( frame, harmo) nb _ harmo harmo= A( frame, harmo) where A(frame,harmo) is the amplitude of the harmonic peak number harmo at the frame number frame f(frame,harmo) is the frequency of the harmonic peak number harmo at the frame number frame nb_harmo is the number of harmonics taken into account After the above is performed we Calculate the HarmonicSpectralCentroid, HSC, for the sound segment as follows: HSC = nb _ frames frame= IHSC( frame) nb _ frames Where nb frames is the number of frames in the sound segment HarmonicSpectralDeviationType The HarmonicSpectralDeviation is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralDeviation within a running window. The instantaneous HarmonicSpectralDeviation is computed as the spectral deviation of log-amplitude components from a global spectral envelope. Unit: [-] Range: [0,] The The use of a logarithmic amplitude scale instead of a linear one is derived from experimental results on human perception of timbre similarity. The use of a logarithmic scale instead of a linear one significantly increases the explanation of these experimental results. HarmonicSpectralDeviation may be extracted using the following algorithm: Estimate the harmonic peaks over the sound segment, then estimate the spectral envelope (SE) (Informative) To approximate the local Spectral Envelope take the mean amplitude of three adjacent harmonic peaks. To evaluate the ends of the envelope simply use the mean amplitude of two adjacent harmonic peaks. For harmo = SE ( frame, harmo) = A( frame, harmo) + A( frame, harmo + ) For harmo = to nb_harmo- For harmo = nb_harmo SE( frame, harmo) = i= A( frame, harmo + i), harmo =, nb _ harmo 3 Page 30

32 SE( frame, harmo) = A( frame, harmo ) + A( frame, harmo) Where nb_harmo is the number of harmonics taken into account calculate the instantaneous HarmonicSpectralDeviation, IHSD, for each frame as follows: IHSD( frame) = nb _ harmo harmo= log 0 ( A( frame, harmo)) log nb _ harmo log harmo= 0 0 ( A( frame, harmo)) ( SE( frame, harmo)) where A(frame,harmo) is the amplitude of the harmonic peak number harmo at the frame number frame SE(frame,harmo) is the local Spectral Envelope around the harmonic peak number harmo nb_harmo is the number of harmonics taken into account Calculate the HarmonicSpectralDeviation, HSD, for the sound segment as follows: HSD = nb _ frames frame= IHSD( frame) nb _ frames Where nb_frames is the number of frames in the sound segment HarmonicSpectralSpreadType The HarmonicSpectralSpread is computed as the average over the sound segment duration of the instantaneous HarmonicSpectralSpread within a running window. The instantaneous HarmonicSpectralSpread is computed as the amplitude weighted standard deviation of the harmonic peaks of the spectrum, normalized by the instantaneous HarmonicSpectralCentroid. Units: [-] Range: [0,] The HSS may be extracted using the following algorithm. Estimate the harmonic peaks over the sound segment Estimate the instantaneous HarmonicSpectralCentroid, IHSC, of each frame. Calculate the instantaneous HarmonicSpectralSpread, IHSS, for each frame as follows: IHSS ( frame) nb _ harmo A ( frame, harmo) harmo= = nb _ harmo IHSC( frame) harmo= [ f ( frame, harmo) IHSC( frame) ] A ( frame, harmo) where A(frame,harmo) is the amplitude of the harmonic peak number harmo at the frame number frame Page 3

33 f(frame,harmo) is the frequency of the harmonic peak number harmo at the frame number frame nb_harmo is the number of harmonics taken into account Calculate the HarmonicSpectralSpread, HSS, for each sound segment as follows: HSS = nb _ frames frame= IHSS( frame) nb _ frames Where nb_frames is the number of frames in the sound segment HarmonicSpectralVariationType The HarmonicSpectralVariation is defined as the mean over the sound segment duration of the instantaneous HarmonicSpectralVariation. The instantaneous HarmonicSpectralVariation is defined as the normalized correlation between the amplitude of the harmonic peaks of two adjacent frames. Units: [-] Range: [0,] The HSV may be extracted using the following algorithm. Estimate the harmonic peaks over the sound segment. Calculate the instantaneous HarmonicSpectralVariation, IHSV, for each frame as follows: IHSV( frame) = harmo= nb _ harmo harmo= nb _ harmo A( frame, harmo) A( frame, harmo) A ( frame, harmo) nb _ harmo harmo= A ( frame, harmo) Where A(frame,harmo) is the amplitude of the harmonic peak number harmo at the frame number frame nb_harmo is the number of harmonics taken into account Calculate the HarmonicSpectralVariation, HSV, for the sound segment as follows: HSV = nb _ frames frame= IHSV( frame) nb _ frames Where nb_frames is the number of frames in the sound segment SpectralCentroidType The SpectralCentroid is computed as the power weighted average of the frequency of the bins in the power spectrum. Unit: [Hz] Range: [0,sr/] where sr stands for sampling rate. The SC may be extracted using the following algorithm. Determine the power spectrum over the sound segment. (Informative) While numerous methods have been proposed in order to compute the power spectrum, one can simply use the Welch method (averaged periodogram) both for harmonic and percussive sounds. Calculate the SpectralCentroid, SC, for the segment as follows: Page 3

34 SC ( frame) powerspectrum _ size f ( k) S( k) k= = powerspectrum _ size k= S( k) where S(k) is the kth power spectrum coefficient f(k) stands for the frequency of the kth power spectrum coefficient TemporaryCentroidType The TemporalCentroid is defined as the time averaged over the energy envelope. Unit: [sec] Range: [0,determined by the length of the signal] The TemporalCentroid may be extracted using the following algorithm: Calculate the Signal Envelope, SEnv, Calculate the TemporalCentroid, TC as follows: TC length( SEnv) n= = length( SEnv) n / sr SEnv( n) n= SEnv( n) where SEnv is the Signal Envelope. sr is the Sampling Rate. Future Experimentation Includes dynamic interaction with hormonal levels to steer emotional state vector from one state to another vector/state. Also music retrieval, I want a happy song. Primarily, both of the aforementioned future work is dependent on scalar recognition and assimilating an emotional ontology to certain scales. Herein I propose to first work with the Blues scale. The Blues is a music genre most likely to produce a specific emotion to humans assimilated with western culture. Conversely, the musical genres of Jazz, Classical, Country and Rock include musical structures and instances that can make some humans happy but others excited. Certain instances make some humans agitated while making others feel motivated. Certain instances make some humans mellow while making others excited and so on. However, the Blues is different. The Blues emits sound waves that, for the most part, will indeed make one feel blue or sad. Webster's 93 Dictionary defines the blues as ) a type of folk song that originated among Black Americans at the beginning of the 0th century; has a melancholy sound from repeated use of blue notes; and ) a state of depression; as, he had a bad case of the blues. Nevertheless, what is the mathematical explanation for this anomaly? In other words, what is happening in the mathematical interpretations of waveforms emitted from a blues scale that invokes humans to emit emotions of sadness or melancholy? Firstly, Page 33

I propose that if one can mathematically pin down exactly why the blues scale makes people melancholy then, clearly, one can search for sad music in a data base and secondly, one can implement the

35 I propose that if one can mathematically pin down exactly why the blues scale makes people melancholy then, clearly, one can search for sad music in a data base and secondly, one can implement the rules founded in the blues mathematical synthesis to other less definitive emotions in music such as happy, excited and so on. Blues music, as a whole, contains three attributes: a rhythm is closely associated to African rhythms. a pentatonic sounding music accentuating its flattened III and VII. a call & response structure similar to European and English folk music using the same three chords over a diatonic scale. Blues rhythms can easily be associated to African rhythms because it originated with the African slaves in North Mississippi Delta prior to the Civil War of the United States. There is a also myth, which I believe is probably not too far from the truth: Blues folklore states that the slaves would hear the piano playing of their white owners and try to replicate it in their guitar playing, field hollers, ballads and spiritual/church music. However, they never quite got it right, or, they purposefully, simply, preferred to flatten the III and VII because in their state of sadness and depression it simply felt better. Blues lyrics, are not part of the mathematical structure of notes that emit emotions, but suffice to say, it typically encompass misfortunes and trouble. However, taking these three points into consideration, we know that there is ) relentless rhythm in the music that ) repeats, with the use of 3) flattened III s and VII s, which taken as a whole fits perfectly with a sorrowful story and the forlornness of a lost soul many times over. With the aforementioned in mind, we now look a little deeper into the musical structure. Humans characterize sound waves by three parameters: its pitch, loudness and quality. Loudness is measured in a logarithmic scale ("decibels"), defined as ten times the exponent of 0 for the loudness value. Pitch is the frequency of musical "notes" of arranged on a musical "scale". Western culture s "equitonic" scale consists of "octaves", each containing 8 whole notes "A", "B", "C", "D", "E", "F", "G", and "A". The ratio from one of the keys to the next is the same for each key and is centered on the "Middle C" note at 64 hertz. The equitonic, as displayed on a piano keyboard shows keys in each octave. Here, the seven white keys are the "whole" notes and five black keys, or "sharp and flat notes" and "flat notes". The Blues scale is a subset of the equitonic scale. An octave is a doubling of frequencies, therefore its interval is the th root of ( ). Scales and chords utilize harmonics and overtones. In order to understand the blues scale one needs to understand scales and harmonics. A harmonic exists when one multiplies a notes frequency by a whole number. Harmonics and overtones are the same thing labeled differently. For example, the first overtone of a frequency equals the second harmonic. I will refer to harmonics for the sake of consistency. Note that the second harmonic is a note with twice the frequency, or, commonly known as an octave. scale. The most common scales is the major scale and its seven modes which are the same scale but starting with a root note on another note. It is said that each of the seven modes has a distinct emotion to it. The major or Ionian is for happy music. The minor, or Aeolian and Dorian scales that have roots on the nd & 6 th notes of the major are for sad or dark music. The Blues scale, which is based on the Pentatonic is found using the same method as the Aeolian and Dorian scales except it only incorporates four notes after the root. However, in modern music we say the Pentatonic has 6 notes because we include the playing of the root, one octave higher as being part of the scale. The African slaves began copying and then either mistakenly or purposefully flattening its III and VI notes over this pentatonic scale. Something, yet fully understood by science, inherent in the aforementioned evokes a Page 34

36 human to fell sad. Why? I assert that the answer lies within an understanding of the physics, Fourier transforms and harmonics of waveforms. To understand harmonics, consider a simple sinusoidal wave having the form y = A sin(ωt). Where A is the amplitude. A is also periodic, meaning that the wave having a frequency, f, repeats itself with a period, T. ω = πf = πf / T When sounds emanates from an instrument there is a fundamental frequency accompanied by integer multiples of the fundamental frequency called overtones. As mentioned above, overtones that are integral multiples of the fundamental are called harmonics. We can express a waveform F(t) simply as the series addition of harmonics: Some of these harmonics, when paired with others give westerners a sense of a good sounding notes These musical intervals include: Unisom Octave Fifth Fourth : Major Third : Minor Third 3: Major Sixth 4:3 Minor Sixth Accordingly there are groups of waveforms consisting of three or more notes that also have nice or agreeable responses to westerners and we call these chords. Major chords have three notes in a ratio of 4:5:6 The ratio of the frequencies of the major diatonic scale. f is the frequency of the root or tonic Both musicians and non-musicians acknowledge and understand that a major chord has a natural and pleasant sound. Mathematically we can say yes, of course! Because there is perfect integration of one another's harmonic patterns. For example, in A Major, the sixth harmonic is between F# and G, and the seventh is A again three octaves up. But one may still ask what about sad notes? The sad notes of the blues can be comprehended by realizing a slight shift. Looking at a chart of the harmonics of a C Major chord (C-E-G) one can see the relationships: C E G Now lets do it again, but on the C minor chord. It is clear we have induced a darker feel. Mathematically we can say Oh of course, there is a missing harmonic and it is this that creates a darker "feel" to the chord : C D# Page 35

G 39.0 784.0 76.0 568.0 960.0 35.9 The human ear takes the complicated sound wave and measures the relative phases of the above shown overtones into a perception of the timbre of the note.

37 G The human ear takes the complicated sound wave and measures the relative phases of the above shown overtones into a perception of the timbre of the note. Fourier proved that any vibration can represented mathematically. Looking at "sawtooth" wave: Fourier takes the fundamental and the first harmonic and adds them together. In the next step Fourier takes sum, as shown above and adds it to the next overtone: Continuing the aforementioned until one gets to the 9th overtone it is closer to the desired saw tooth shape I suggest and hop eto prove that that upon defining degrees to and from the major, played over varying scales, lies the answer to emotion in music. Humankind knows the answer is there, we know blues makes us all sad and certain other music makes us happy. Some questions remain to be answered, such as, will the pentatonic s flattened III and VII be made more visible by first transforming the major, over its scale to a saw tooth and then, and only then, looking for variances in frequency, amplitude and so on, yield an answer. Page 36

MIRAI: Multi-hierarchical, FS-tree based Music Information Retrieval System

MIRAI: Multi-hierarchical, FS-tree based Music Information Retrieval System Zbigniew W. Raś 1,2, Xin Zhang 1, and Rory Lewis 1 1 University of North Carolina, Dept. of Comp. Science, Charlotte, N.C. 28223,