Mining for Scalar Representations of Emotions in Music Databases. Rory Adrian Lewis

Size: px

Start display at page:

Download "Mining for Scalar Representations of Emotions in Music Databases. Rory Adrian Lewis"

Ashlee Owens
5 years ago
Views:

1 Mining for Scalar Representations of Emotions in Music Databases. by Rory Adrian Lewis A dissertation proposal submitted to the faculty of The University of North Carolina at Charlotte in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Information Technology Charlotte 2007 Approved by: Dr. Zbigniew Ras Dr. Mirsad Hadzikadic Dr. Moutaz J. Khouja Dr. Tiffany M. Barnes

3 iii ABSTRACT RORY ADRIAN LEWIS. Mining for Scalar Representations of Emotions in Music Databases. (Under the direction of DR. ZBIGNIEW RAS) This dissertation examines an important issue in the Music Information Retrieval domain: codifying the classification of harmonic pitches for the purpose of discovering action rules to interact with scalar music theory. Our intent is twofold: (1) Codify scalar music theory for the purpose of classification rules mining to build a system for automatic indexing of music by scale, region, genre, and emotion, (2) Use action rules mining to permit developers to manipulate a search s resultant composition s genre and tension while still retaining the bulk of original music score. We propose a categorization system for music based upon classification rules and action rules. Herein is a procedure that draws categorization from temporal and spectral attributes of signals that result in a structure conducive for data mining emotions in music based upon classification rules and action rules.

4 ACKNOWLEDGMENTS I would like to express my sincere gratitude to my advisor Dr. Zbigniew W. Ras for the opportunity to explore the field of data mining. His support and encouragement throughout my Ph.D were invaluable. Without his insightful comments and guidance, the study of data mining music emotions in music databases and completion of my Ph.D dissertation would have been impossible. I also would like to acknowledge Dr. Mirsad Hadzikadic, Dr. Tiffany Barnes and Dr. Moutaz Khouja for their support as my professors and committee members. This thesis is based in part upon work supported by the National Science Foundation under Grant Number IIS Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation.

5 v TABLE OF CONTENTS CHAPTER 1: INTRODUCTION Motivation and Problem Statement Road Map Prerequisite Knowledge Designing the Database Music Information Retrieval Empirically Categorizing Emotions Audio Data Parameterization Data Labeling A 3-D domain for Data Labeling Modeling a Representative Domain for Emotions CHAPTER 2: PREREQUISITE KNOWLEDGE - Part I: KDD Constructing Sound Objects CHAPTER 3: PREREQUISITE KNOWLEDGE - Part II: Music INTRODUCTION TO MUSIC THEORY AND KDD Musical Notes The Perfect Fifth MajorThird INTRODUCTION TO MUSIC THEORY AND KDD KDD and The Physics of Music FOURIER AND THE HUMAN EAR Music Theory in a Nutshell Music Waveforms The Blues Scale

6 vi 3.4 Representing Scales Balanced Trees: A Structure for Musical Scales Balanced Scaler/Emotion Tree Neuroendocrinology Scalar Emotions NodeT:3:1TheAugmentedScale NodeT:3:2:1TheBluesMajorScale Node1:3:2:2TheMinoredPentatonicandBluesScale Conclusion of Neuroendocrinology CHAPTER 4: PREREQUISITE KNOWLEDGE - Part III: Digital Music Associating the Emotional 3D Domain with Music Instruments Initial Mining of Rules for Music Instruments CHAPTER 5: PREREQUISITE KNOWLEDGE - Part IV: MPEG-7 and Timbre Descriptors Timbre Descriptors CHAPTER 6: MINING EMOTIONS IN MIR Instrument Classification A three-level empirical tree LogAttackTime (LAT) AudioHarmonicityType (HRM) Sustainability (S) Experiments Testing:c HRM,LAT,S,withHS HRM,LAT,S,withHS HRM,LAT,S,withInstruments... 59

7 vii ResultingTree Testing: Rough Sets CHAPTER 7: MUSIC SIGNAL SEPARATING Introducing a New Approach for Music based on KDD Polyphonic Pitches and Instrumentations Signal Analysis Our BSS Experiments Initial Blind Signal Experiments Results Experiment 1: Classification of original sounds Experiment 2: Classification of echoed sounds Experiment3:Classificationofsubtractedpiano Experiment 4: Classification of subtracted Clarinet (10 folds) Second Blind Signal Experiments Noise Variations Creating a Real-World Training and Testing Sets Experimental Parameters MPEG-7 features Experimental Procedures Conclusion ICA Timeline CHAPTER 8: MINING EMOTIONS FROM ACTION RULES Western Music Pitch and Scales Tension in Music Information Systems and Action Rules KD-based Scalar Music Theory

8 viii 8.4 Experiments Identifying the Scale Stage 1 of 3: Initial 100% matches Stage 2 of 3: Reducing the Search Space of Distance Algorithm 111 BarWeights PhraseWeights TotalWeights Stage 3 of 3: Calculating the Distance Between Jump Sequences Query Answering and Action Rules in MIR Systems Action Rules to Manipulate Emotions of the Song Experiment Results CHAPTER 9: CONCLUSION Future Research

9 ix List of Figures : Hevner s 2-D Arousal vs. Appraisal Plane : Tato s 3D Domain For Emotions Adjective Circle according to K. Hevner : Lewis 3D Domain For Emotions: Cube : Lewis 3D Domain For Emotions: Axis : Lewis 3D Domain For Emotions: XY : Lewis 3D Domain For Emotions: Z Planes : Lewis 3D Domain For Emotions: X0Z, 0YZ : Emotional 3D Domain For Emotions Brave : Emotional 3D Domain For four Emotions Constructing Sound Objects Marimba: Single Stroke and Crescendo Roll Violin Sound Objects Single Stroke V. Crescendo Rolls Naming the Tag Constructing First of the Seven Files Sound Forge includes the SFK Files Transferring Files Transferring Files NotesofthePiano Perfect Fifth MajorThird Fourier: Fundamental Fourier:Nextovertone... 33

10 3.6 Fourier:9thOvertone TheoreticalProcedure:Subtracting x 6.1 Illustration of log-attack time Five Levels of Sustainability C4.5resultstesting Sachs-Hornbostel-level HRM, LAT and S descriptors Anoisycocktailparty C 44,100Hz, 16 bit, stereo A Bb 44,100Hz, 16 bit, stereo Piano and 44,100Hz, 16 bit, stereo - No Noise Piano and 44,100Hz, 16 bit, stereo - 01 Noise Piano and 44,100Hz, 16 bit, stereo - 02 Noise Piano and 44,100Hz, 16 bit, stereo - 02 NoisePiano and 44,100Hz, 16 bit, stereo - 03 Noise TheoreticalProcedure:Subtracting A Gaussian function is used because it reflects the tolerance of most listeners for slight mistuning of chords by effectively broadening the regions of perceived tension and resolution, but the exact shape of the curve remains empirically uncertain. Here and elsewhere we assume twelve-tone equitempered tuning (for further details see [Cook and Fujisawa, 2006]) Example score of a Pentatonic Minor Scale played in the key of C Example score of a Pentatonic Minor Scale played in the key of C Bar example of Pentatonic Minor Scale played in the key of C

11 xi List of Tables 1.1 The five contributions to MIR presented in this thesis. Mining emotions in the current state-of-the-art of MIR. Music information separation and splitting for polyphonic MIR. Music information manipulation using Action Rules. Mining emotions from Action Rules and finally: Merging the mining emotions in MIR and Action Rules Adjective-Based Labeling DEmotionDomain Intervals Happy harmonicsofacmajorchord Darker harmonicsofacminorchord C Major: The header represents the Tonic at 1 and the following notes of a scale. For a Major scale in the key of C, the tonic is C, the 2nd note, 2m is D, the 3rd note 3m is and E and so forth Generic Major: The header represents the Tonic at 1 and the following notes of a scale. For a Generic Major, meaning it could be in any key, the distance from the tonic to itself is zero, the distance from the tonic to the second note is 2 semitones, the second note to the third note is 2 semitones, the third note to the fourth note is 1 and so forth Scalar Charts Depicting Significant Musical Scales: Node T:3:1 - The Augmented Scale, Node T:3:2:1 - The Blues Major, Node 1:3:2:2 -TheMinoredPentatonicandBluesScale Normalizing 4 representations DEmotionDomain... 50

12 xii 6.1 Calculating rules for S with a minimum of 90% confidence with support ofnolessthan6usingrses Calculating rules for Articulation with a minimum of 90% confidence withsupportofnolessthan6usingrses Experiment Experiment2withnoise Experiment3withnoise Experiment4withnoise Training Set: Sample Sound Mixes: Training Set Preparation Sample Sound Mixes: Testing Set Preparation Sample Training Set: Resultant Sounds by Sound Separation Sample Testing Set: Resultant Sounds by Sound Separation Overall Accuracy Results Tree J8, Logistic, Local Weighted Learning,BayesianNetwork Individual Results Two classification rules extracted from S Representation of a Pentatonic Minor Scale Possible Representative Jump Sequences from All Optional Roots Possible Representative Jump Sequences for the Input Sequence Basic Score Classification (BSC) Database Rules Extracted from BSC Database by LMC

13 xiii 8.7 Step 1: Computing duration for each note of each song where the pitch is in the tuple labeled Note and the Duration is in the tuple labeled Duration where each frame s duration unit is 0.12 seconds is the numerical value associated with duration to show how many different times each note occurs in each song and it total amount of time duration units. For example the note a scored a total of 5 in Blue Sky meaning that in a is heard for 5 * 0.12 seconds or 0.6 seconds in total Mining All Possible scales and Cuts in Blue Sky Phrase 1, it took Pentatonic major but it is in e. In Phrase 2 the original is less than 3 notes so we ignore. In Phrases 3 and 6 there is no 100% match so we pick up c ef g c with 3224 for a distance search according to the highest weighted possible scale in c which is Pentatonic Major. In Phrase 4 c ef g bc with which was a Pentatonic Minor, its accounted but we still pick it up because it was a Minor, not Major Pentatonic. In Phrase 5 c ef g with 321 which was a Blues Major, its accounted but we still pick it up because it was not Major Pentatonic Mining All Possible scales and Cuts in Blue Sky Note that unlike Table 8.8that shows cuts, this table illustrates the process before thecutsdivideeachphrase

14 xiv 8.10 Final Results In Blue Sky a c Pentatonic Major has a score of 8 making it the most likely scale and key. This is correct. In Nobody Loves You a a Balinese has a score of 8 making it the most likely scale and key. This is correct based on the data but the input data was polluted because the input system could not correctly assimilate polyphonic notes, which are in abundance in this piece of music. The correct scales, to future MIR systems that can assimilate polyphonic sounds would be mixture c Spanish 8-Tone scale and c Major scale Action Rulesapplication on the Pentatonic Major versus the Augmentedforeachphrase

15 CHAPTER 1: INTRODUCTION It is undisputed that throughout time and across any continent or culture, mothers have always and will always be able to sing a lullaby to their babies and make them fall asleep. How is this? It is also well established that one looks at any military funeral in western society, one will see more people crying than at any other time throughout the funeral, when they hear the trumpet playing Taps. Why is this? Would it be the same if we play Hello Dolly? Probably not -but why? Would it be the same if we play Taps real fast on a xylophone? Again, probably not -but why is the horn and that song making so many people cry, even the ones holding in their tears? It is also known that children, be they in Africa, Taiwan, Bahrain, New York City or wherever you choose, will shriek with laughter when that boing sound resonates as Fred Flintstone smashes up against a wall. How come they don t cry when they hear that boing. These are the questions the authors pursue to answer by constructing two types of music databases and mining them for knowledge that describes the aforementioned. It is known in the field of psychology and neuro-endocrinology that data from neurotransmitters in laboratories prove that certain music scales evoke measurable sensory sensations and emotions [Pavel, Valentinuzzi and Arias, Peretz and Hyde]. On the point of emotional analysis, the science contains a varied array of papers concerning emotions in music [Sevgen, 1983, McClellan, 1966]. Furthermore, it is understood in the field of Music Information Retrieval (MIR) that if a machine, given a polyphonic musical wave-form, could recognize all the instruments and the correlating notes each instrument played then, if given the key, it could calculate the scale of the

16 2 music [Shmulevich et al., ], if given the scale, it could calculate the key. In summation, if we can find the scale and/or key of a piece of music then we can also mine for emotions. The obstacles preventing MIR methods from successfully mining emotions in music are weak Blind Source Separation (BSS) of musical instruments in a polyphonic domain, imprecise instrument identification and the inability to find a scale unless given the key or vice-versa. This thesis includes work we presented in BSS [Lewis et al., a, Lewis et al., b], processes of mining music in a music database (see [Wieczorkowska et al., 2005] for details) set in a non-hornbostel hierarchical manner (see [Lewis and Wieczorkowska, ] and [Lewis et al., c] for details). 1.1 Motivation and Problem Statement Data mining is historically synonymous with objective searches. In recent times subjective search methodologies have emerged upon the art of data mining. This thesis embarks into the highly subjective domain of mining emotions. Emotions can be expressed in many forms and degrees. Music is one of the most popular means that humans express emotions. However, the perception of music is incredibly subjective. This thesis pivotal question is: How does one mine emotions in an objective form from unlabeled music database and then manipulate the music to empirically match the mined emotion? Accordingly, this thesis first presents advances its made in the collecting and labeling of data for the purpose of using knowledge discovery to find rules for discovering emotions in music audio files. Secondly, it describes how once music is mined for a specific emotion, through the use of action rules, our machine manipulates the music to further satisfy the emotional state being sought by the user. The machine and methodology this thesis presents is referred to by the acronym MIRAI which stands for Music Information Retrieval Automatic Indexing. MIRAI thus far has a database

17 3 of more than 4000 sounds created, labeled and converted into MPEG-7 descriptors explicitly for the aforementioned adjectives. 1.2 Road Map The thesis is divided into three sections: 1) prerequisite knowledge, 2) building the database and 3) the core motivation of this thesis, a novel music information retrieval methodology Prerequisite Knowledge To appreciate the advances made in Music information Retrieval (hereinafter MIR ) presented herein it is recommended that one is familiar with recent advances made in the arts of Knowledge Discovery, Multimedia in Databases, Data Mining, Music Theory and MPEG-7 technology. Accordingly, Chapter 2 presents both a background to and advances made in Knowledge Discovery in Databases (hereinafter KDD ). Specifically, Chapter 2 focuses on the role KDD has in multimedia databases. Chapter 3 presents a background of music theory and its relationship with the physics of music. We then describe how our approach to Knowledge Discovery in music databases works in correlation with music theory and the physics of music. With this said we present the novel approach we ve developed which combines the diverse arts of KDD with music theory and the physics of music. Chapter 4 presents advance in MPEG-7, an essential tool necessary to accommodate the core motivation of this dissertation Designing the Database With the aforementioned knowledge base presented in chapter 2 - through 4, chapter 5 illustrates the reasons why the MIRAI database is designed in the unique manner presented in this thesis. First we illustrate classical database design. Second we point out in detail why classical database design would not work for the advances in MIR presented. Finally we present how in the process of solving the problem of trying to contort classical database design to meet the needs of our advanced MIR methodol-

ogy, it lead to the current database design. This section describes the database design both as a methodology and in the actual SQL, java and.asp we used in MIRAI 1.2.

18 ogy, it lead to the current database design. This section describes the database design both as a methodology and in the actual SQL, java and.asp we used in MIRAI Music Information Retrieval As seen in Table 1.1, the core of this dissertation involves five contributions to the art of Music Information Retrieval which are presented in Chapters 6 through 10 respectively. These chapters are: Mining emotions in the current state-of-the-art of MIR, Music information separation and splitting for polyphonic MIR, Music information manipulation using action rules, and Mining emotions from action rules. 1.3 Empirically Categorizing Emotions It is known that listeners respond emotionally to music [Vink, 2001], and that music may intensify and change emotional states [Sloboda, 1996]. One can discuss if feelings experienced in relation to music are actual emotional states, since in general psychology, emotions are currently described as specific process-oriented response behaviors, i.e. directed at something (circumstance, person, etc.). Thus, musical emotions are difficult to define, and the term emotion in the context of music listening is actually still undefined. Moreover, the intensity of such emotion is difficult to evaluate, especially that musical context frequently misses influence of real life, inducing emotions.

19 5 However, music can be experienced as frightening or threatening, even if the user has control over it and can, for instance, turn the music off. Furthermore, it is also known that music can be defined in various ways, for instance, as an artistic form of auditory communication incorporating instrumental or vocal tones in a structured and continuous manner, or as the art of combining sounds of voices or instruments to achieve beauty of form and expression of emotion [Cross, 2001]. Therefore, music is inseparably related to emotions. Musical structures itself communicate emotions, and also synthesized music aims at expressive performance [Juslin and Sloboda, 2001], [de Mantaras and Arcos, 2001]. The experience of music listening can be considered within three levels of human emotion [Huron, 1997]: autonomic level, denotative (connotative) level, and interpretive (critical) level. According to [Lavy, 2001], music is heard: as sound. The constant monitoring of auditory stimuli does not switch off when people listen to music; like any other stimulus in the auditory environment, music is monitored and analyzed. as human utterance. Humans have an ability to communicate and detect emotion in the contours and timbres of vocal utterances; a musical listening experience does not annihilate this ability. in context. Music is always heard within the context of knowledge and environment, which can contribute to an emotional experience.

20 6 as narrative. Listening to music involves the integration of sounds, utterances and context into dynamic, coherent experience. Such integration is underpinned by generic narrative processes (not specific to music listening). Emotions can be characterized in appraisal and arousal components, as shown in [Tato et al., 2002]. Intense emotions are accompanied by increased levels of physiological arousal. Music induced emotions are sometimes described as mood states, or feelings. Some elements of music, such as change of melodic line or rhythm, create tensions to a certain climax, and expectations about the future development of the music. In fact, as will be seen in the thesis, ones expectation of an anticipated note will be crucial. It will be shown that interruptions of expectations induce arousal and if the expectations are fulfilled, then the emotional release and relaxation upon resolution is proportional to the build-up of suspense of tension, especially for non-musician listener.[rickard, 2004] Conversely, trained listeners usually prefer more complex music. However, the majority of people receive arousal when they can predict the next note or word in even the most simple of songs. The issue is how does one label emotions so that each emotion can be categorized into a data table? Although music is a delicate subject for scientific experiments, research has been already performed on automatic composition in given style [Pachet, 2004], discovering principles of expressive music performance from real recordings [Widmer, 2003], and labeling music files with metadata [Pachet, 2005]. Also, research on recognizing emotions in audio files has been performed on speech data [Dellaert et al., 1996], [Tato et al., 2002]. Emotions communicated in speech are quite clear. However, in experiments described by Dellaert et al. in [Dellaert et al., 1996] human listeners performed recognition of emotions in speech with about 80% correctness, in experiments with over a 1000 utterances from different speakers, classified into 4 categories: happy, sad, anger, and fear. The results obtained in machine classification were very similar, also reaching 80% correctness. [Tato et al., 2002] also obtained recognition rate

21 7 approaching 80%, for 3 classes regarding levels of activation: high (angry, happy), medium (neutral), and low (sad, bored). See Figure 1.2. The database of about 2800 utterances was used in these experiments. Research on discovering emotions from music audio data have also been recently performed [Li and Ogihara, 2003] on detecting emotions in music, using 499 sound files representing 13 classes, labeled by a single subject. The accuracy ranged for particular classes from about 50% to 80%. Since emotions in music are more difficult to discover than in speech, and even the listener labeling the data reported difficulties with performing the classification task, the obtained results are very good. 1.4 Audio Data Parameterization In case of music audio data, other descriptors are used, as seen in articles written by [Peeters and Rodet, 2002], [Tzanetakis et al., 2000], [Wieczorkowska et al., 2003a]. These features include structure of the spectrum, time domain features, and also timefrequency description. Since research on automatic detection of emotions in music is very recent, there is no significant comparison of descriptor sets and their performance for this purpose. However, Li and Ogihara applied parameters provided in [Tzanetakis et al., 2000], describing timbral texture features, rhythmic content features, and pitch content features where their dimension of the final feature vector was 30.[Li and Ogihara, 2003] Data Labeling One of difficulties in experiments on recognition of emotions in music is labeling of the data. The emotions can be described in various ways. One of the possibilities is presented in Figure 1.1, proposed by Hevner [Hevner, 1936]. This labeling consists of 8 classes, although not all adjectives in a single group are synonyms, see for instance pathetic and dark in class 2. In 2002 at the 7th International Conference on Spoken Language Processing, Tato

22 8 very active angry afraid very negative sad bored neutral excited happy very positive relaxed content very passive Figure 1.1: Hevner s 2-D plane presented in 1936 represents emotions in arousal vs. appraisal plane. Arousal values range from very passive to very active, appraisal values range from very negative to very positive presented a new method of labeling data representing emotions in a 3-dimensional space. The research detected emotions in speech along an activation dimension using the following adjectives: angry, happy, neutral, sad, bored. The paper acknowledged that a 2-dimensional space may describe an amount of activation and quality, or conversely, an amount of arousal and valence (pleasure) [Tato et al., 2002], as mentioned in Section 1.3. The 3-dimensional space considers 3 categories: pleasure (evaluation), arousal, and domination (power). Arousal describes the intensity of emotion, ranging from passive to active. Pleasure describes how pleasant is the perceived feeling and it ranges from negative to positive values. Power relates to the sense of control over the emotion. Examples of emotions in 3-dimensional space can be observed in Figure 1.2. In the early stages of our research we studied Li and Ogihara s adjective-based labeling using 13 classes, [Li and Ogihara, 2003] (See Table 4.1), where each class is labeled by one, two, or three adjectives. These groups are based on redefined (by Farnsworth) Hevner adjectives, supplemented with 3 additional classes. To see for ourselves how this worked, we conducted our own research on this in 2005 in Extracting Emotions from Music Data. [Wieczorkowska et al., 2004a] Al-

23 9 arousal anger fear happy sad content pleasure potency Figure 1.2: Tato s 3D Domain presented in 2002 at 7th International Conference on Spoken Language Processing Figure 1.3: Adjective Circle according to K. Hevner

24 10 Li and Ogihara s adjective-based labeling cheerful, gay, happy fanciful, light delicate, graceful dreamy, leisurely longing, pathetic dark, depressing dramatic, emphatic agitated, exciting frustrated sacred, spiritual mysterious spooky passionate bluesy. Table 1.2: Adjective-Based Labeling: Li and Ogihara s adjective-based labeling using 13 classes of redefined Hevner adjectives supplemented with 3 additional classes together, the following adjectives were used in this research: cheerful, gay, happy, fanciful, light, delicate, graceful, dreamy, leisurely, longing, pathetic, dark, depressing, sacred, spiritual, dramatic, emphatic, agitated, exciting, frustrated, mysterious, spooky, passionate, and bluesy. Initially the adjectives helped distinguish songs that fell into more than one of the previously used Hevner-based categories. However, it became clear that the adjective-based categorization lacked an environment that was conducive for automatic indexing of music information and it also was not comparable to empirically distinguishing emotions. The two aforementioned problem lead us back to Hevner where it was decided to re-focus upon Hevner by incorporating Hevner s expanded adjective based labeling See Figure1.3 as one plane in a multidimensional environment. This thought came about because, as mentioned above in 1.4.4, research showed that emotions can be also represented in 2 or 3 dimensional space, thus allowing labeling along the chosen axes. (See Chapter 5.1 on page 51 for details of the timbre descriptors we used for the labeling.) We used single labeling of classes by a single subject because this allowed us to check, exactly, the quality of parameterization chosen as a tool for finding dependencies between subjective and objective audio description. As we did this we also performed multiple labeling and multi-subject assessments in order to check the consistency of perceiving emotions from subject to subject. The data we collected contained a few dozen examples for each of 13 classes, labeled with adjectives. One,

11-5,5,-5-5,5,0-5,5,5 0,5,5 5,5,5-5,0,-5 5,0,5-5,-5,-5 0,-5,-5 5,-5,0 5,-5,5 5,-5,-5 Figure 1.4: Lewis 3D Domain represents an environment to include all emotions.

two, or three adjectives were used for each class, since such labeling may be more informative for some subjects.

After parameterizing the audio data and calculating the feature vectors for each piece we tested automatic classification of emotions using the k-nn algorithm.

25 11-5,5,-5-5,5,0-5,5,5 0,5,5 5,5,5-5,0,-5 5,0,5-5,-5,-5 0,-5,-5 5,-5,0 5,-5,5 5,-5,-5 Figure 1.4: Lewis 3D Domain represents an environment to include all emotions. The x-axis represents the degree of happiness, the y-axis represents the degree of energy and the z-axis represents the level of confidence. two, or three adjectives were used for each class, since such labeling may be more informative for some subjects. The final collection consisted of more than 300 pieces (whole songs or other pieces). After parameterizing the audio data and calculating the feature vectors for each piece we tested automatic classification of emotions using the k-nn algorithm. Therefore, our work yielded measurable outcomes thus proving its usefulness. For details see 5.1, [Wieczorkowska et al., 2004a]. The point here is that our experiments here proved that the labeling could work. Furthermore, it solidified the notion that more planes would only help make the system become more objective A 3-D domain for Data Labeling As it became evident to us that data-labeling can indeed be represented on more than one plane, it occurred that one can experience an emotion that can be described

26 in different ways. For example, one can be happy and meditative or happy and very energetic. So we began by representing, ala Hevner (See Figure 1.1), the x-axis at -5 as the very unhappy state, neutral at 0 and then very happy at +5. Accordingly, we decided to make the y-axis represent the degree of energy and the x0-axis represent the level of confidence, see Figure 1.5. Looking at Figure 1.4 one will note that the three axis are represented in the order: x then y then z. The intersection of all three axis is 0,0,0. The most negative value is -5,-5,-5 which is located at the left (x=-5 ), bottom (y=-5 ) forward (z=-5 ) corner. Likewise the most positive place is represented as 5,5,5 which is located at the right (x=5 ), top (y=5 ) back (z=5 ) corner. It will also be important when reading this thesis to be familiar with the planes, x, y and z. We denote a plane by stating which axis is zero and representing the other two axis by name. For example, looking at Figure 1.6 the xy0 plane is shown because we see all of the points on both the x and the y axis located at the point y=0. Therefore we name this axis the xy0 plane. Here -5,5,0 = energetic and depressed, 5,5,0 = energetic and happy, -5,-5,0 = relaxed and depressed, and 5,-5,0 = relaxed and happy. Staying on the xy0 plane but moving along the z axis each point in Figure 1.6 is expounded upon because as seen in Figure 1.6, the z axis was 0. As illustrated in Figure 1.7 each point has 3 labels. For example -5,5,-5 = energetic depressed and terrified, 5,5,-5 = energetic, happy and terrified -5,-5,-5 = relaxed, depressed and terrified, 5,-5,-5 = relaxed, happy and scared, -5,5,5 = energetic depressed and brave, 5,5,5 = energetic, happy and brave -5,-5,5 = relaxed, depressed and brave, and 5,-5,5 = relaxed, happy and brave. Again, as illustrated in Figure 1.8 each point also has 3 labels such as -5,0,-5 = depressed and terrified, 5,0,-5 = terrified and happy, 0,5,-5 = energetic and terrified, and 5,0,-5 = relaxed and terrified, -5,0,5 = depressed and brave, 5,0,5 = happy and brave, 0,5,5 = energetic and brave, and 5,0,5 = relaxed and brave.

The x-axis represents the degree of happiness, the y-axis represents the degree of energy and the z-axis represents the level of confidence.

27 13 y-axis (energy) energetic depressed z - axis (confidence) brave terrified happy x- axis (happiness) relaxed Figure 1.5: Lewis 3D Domain represents an environment to include all emotions. The x-axis represents the degree of happiness, the y-axis represents the degree of energy and the z-axis represents the level of confidence. energetic and depressed y-axis (energy) energetic and happy depressed happy x- axis (happiness) relaxed and depressed relaxed and happy Figure 1.6: Lewis 3D showing the xy0 plane -5,5,0 = energetic and depressed, 5,5,0 = energetic and happy, -5,-5,0 = relaxed and depressed, and 5,-5,0 = relaxed and happy

14 energetic, depressed and terrified y-axis (energy) energetic, happy and terrified energetic,

terrified relaxed, depressed and brave x- axis (happiness) relaxed, happy and brave relaxed, happy

7: Lewis 3D showing the xy0 plane -5,5,-5 = energetic depressed and terrified, 5,5,-5 = energetic,

scared, -5,5,5 = energetic depressed and brave, 5,5,5 = energetic, happy and brave -5,-5,5 =

depressed and terrified energetic and terrified relaxed and terrified y-axis (energy) terrified

28 14 energetic, depressed and terrified y-axis (energy) energetic, happy and terrified energetic, depressed and brave z - axis (confidence) energetic, happy and brave relaxed, depressed and terrified relaxed, depressed and brave x- axis (happiness) relaxed, happy and brave relaxed, happy and scared Figure 1.7: Lewis 3D showing the xy0 plane -5,5,-5 = energetic depressed and terrified, 5,5,-5 = energetic, happy and terrified -5,-5,-5 = relaxed, depressed and terrified, 5,-5,-5 = relaxed, happy and scared, -5,5,5 = energetic depressed and brave, 5,5,5 = energetic, happy and brave -5,-5,5 = relaxed, depressed and brave, and 5,-5,5 = relaxed, happy and brave. depressed and terrified energetic and terrified relaxed and terrified y-axis (energy) terrified and happy depressed and brave energetic and brave z - axis (confidence) happy and brave x- axis (happiness) relaxed and brave Figure 1.8: Lewis 3D showing the xy0 plane -5,0,-5 = depressed and terrified, 5,0,- 5 = terrified and happy, 0,5,-5 = energetic and terrified, and 5,0,-5 = relaxed and terrified, -5,0,5 = depressed and brave, 5,0,5 = happy and brave, 0,5,5 = energetic and brave, and 5,0,5 = relaxed and brave.

29 Modeling a Representative Domain for Emotions In order to represent an emotion for each sound that an instrument can make the authors chose 33 emotions often used by musician s to describe sounds emanating from instruments. Rather than focus on where each emotion would sit in the 3-D domain, the placement hinged on where that emotion would not lie. For example as illustrated in Table1.3, the emotion called brave is denoted by a *.*.3. When one is feeling brave there is no definitive state of happiness because one could be brave and happily going into battle. Conversely, a father may just have seen a murderer murder his daughter, and even though he is terribly depressed, he is walking towards the murderer of his daughter, brave. Considering that the x-axis represents happiness 5.*.* and Depression -5.*.* it is clear that,as illustrated herein, a person could be anywhere from -5.*.* to 5.*.* then brave cannot be constrained to any particular place on the x-axis. Similarly, in terms of the y-axis is concerned, its easy to imagine a brave warrior battling it out on the battlefield, full of energy. Clearly an emotional state of *.5.* can house the state of brave. But what about *.-5.*? Could we say that just because one is very relaxed that they cannot possibly be brave? Consider the same warrior who is battling it out on the battlefield in the emotional state of *.5.*, the night before the battle he went to bed and fell asleep. During this period he was relaxed, confident and eagerly awaiting the morning when he could slaughter the enemy. Can one say that just because at that time, when his emotional state is *.-5.* going to sleep that its impossible to be brave. Of course not. Hence the emotional state of brave, not only has no bearing on the x-axis, it neither has bearing on the y-axis. However, looking at the z-axis which is confident *.5.* and terrified *.-5.* one can automatically sense that cowards run away in the heat of battle, the brave are confident and stay in fighting in battle even when the going gets rough, a brave warrior will persist and overcome. Clearly, the space housing the emotional state brave cannot be in the *.*.-5 range because that is the domain of the cowards,

16 y-axis (energy) energetic depressed z - axis (confidence) brave terrified happy x- axis (happiness) relaxed Figure 1.9: Emotional 3D Domain showing the *.*.3 domain where the emotion brave exists.

The authors have decided to only use whole numbers. Somewhere between *.*.0 and *.*.5 a person becomes more confident than apathetic it will have to be more than half-way between 0 and 5. Because 2.

9 Figure illustrates the domains where four more emotions reside, namely: Somber -1.

30 16 y-axis (energy) energetic depressed z - axis (confidence) brave terrified happy x- axis (happiness) relaxed Figure 1.9: Emotional 3D Domain showing the *.*.3 domain where the emotion brave exists. the faint of heart who run away in battle. What about *.*.0? Could one be equally terrified and equally brave. No, clearly brave is canceled by terrified and equals apathy. The authors have decided to only use whole numbers. Somewhere between *.*.0 and *.*.5 a person becomes more confident than apathetic it will have to be more than half-way between 0 and 5. Because 2.5 cannot be represented we pick up *.*.3 an dup to *.*.5 as the domain to house the emotional state brave. The authors convention for this is simply *.*.3. See Figure 1.9 Figure illustrates the domains where four more emotions reside, namely: Somber , Melancholy -2,-1,1, Glee and Angst -2.*.-2. Somber is -1.*.* a little on the other side of happy, *.-1.* a little on the negative side of jumping around with energy and a little on the negative side of feeling particularly brave and very confident. Melancholy is similar to somber but there is a little more of a jump in one s energy but not being with their passion takes it a notch down, but not over the -2.5 mark making a value of Glee is not quite ecstatic its just before Angst is not particularly joyful so it goes on the negative side of the x-axis - but not quite in the depressed state -2.*.*. Insofar as the y-axis is concerned angst can be felt while feeling and being energetic, or when lying still in bed, so the y-axis is everything. Angst does not convey confidence, so its on the negative side of confident

31 17 3-D Emotion Domain Emotion X Y Z Humility * * 1 Brave * * 3 Apprehension Impatience Desire Patience Whimsfull Cautiousness Somber Mellow Determination Amusement 2 * 2 Feel-Good 2 * 1 Angst -2 * -2 Moody 2 1 * Melancholy Kindness Acceptance Comfort Confidence Irritability -2 3 * Paranoia Calmness -3 * * Gratitude 3 * 1 Free 3 3 * Zest Happiness 4 * 1 Gladness Joy Delight Depressive Elation Glee Ecstacy Table 1.3: 3-D Emotion Domain: Representation of 33 emotions with their corresponding placement in a 3-d domain

32 18 SOMBER MELANCHOLY GLEE ANGST Figure 1.10: Emotional 3D Domain showing Somber ( ), Melancholy (-2,- 1,1), Glee (5.5.3) and Angst (-2.*.-2) but not over the petrified threshold. Hence angst is -2.*.-2

33 CHAPTER 2: PREREQUISITE KNOWLEDGE - Part I: KDD Knowledge Discovery first embraced multimedia and web databases in 1990 with data warehousing, subject-oriented Databases for decision support, OLAP (on-line analytical processing) and verification of hypothetical patterns. Generally multimedia in KDD is where an information systems allow for the creation, processing, storage, management, retrieval, transfer and presentation of multimedia information. Some examples are Interactive video game, Music Information Retrieval, Google maps, iphone-ipod technology, Video conference, News/movies-On-Demand, Game- On-Demand, Tele-commuting and Multimedia . There are four types of multimedia systems: hypermedia (Hypermedia represents an extension and evolution of the concept of hypertext), multimedia databases, multimedia s, and virtual reality systems. Hypermedia is used as an extension of the term hypertext where graphics, audio, video and hyperlinks intertwine to create a generally non-linear medium of information. Multimedia Database Systems are analogous to textual and numeric database systems and often perform cross media functions in a manner that reduces redundant data storage, permits different views of data, and provides secure access to data. Multimedia database systems, such as this thesis are organized into hierarchical classes based on their common characteristics where objects, such as our sound objects may be composed of many different components. Intelligent Agent Systems utilize artificial intelligence techniques using intelligent agents, that have limited but well-defined responsibilities such as screening electronic mail. In multimedia databases intelligent agents manage and access personal databases, analyze retrieved information and help users create new intellectual works from retrieved and

20 Figure 2.1: To illustrate the construction of the sound objects we use the MUM Volume 6: Latin Grooves 1: Solo Instruments CD which contains 79 wav tracks.

34 20 Figure 2.1: To illustrate the construction of the sound objects we use the MUM Volume 6: Latin Grooves 1: Solo Instruments CD which contains 79 wav tracks. We divide the marimba: single stroke and crescendo roll into marimba: single stroke and marimba: crescendo roll original information. Knowledge Discovery in Database (KDD) and Data Mining understands the application domain and has the ability to extract the target dataset and data mine the database in its raw form. The components of KDD in multimedia comprise a 1) model for classification, regression, clustering and rule generation, 2) the preference criterion and the 3) search algorithm. 2.1 Constructing Sound Objects Constructing sound objects for use in a database invoking both temporal and non temporal features is not a trivial matter. Through trial and error a methodology evolved and to illustrate it in this thesis we use the MUM Volume 6: Latin Grooves 1: Solo Instruments CD. STEP 01: Place the MUM s CD into the drive and view the contents. As one can see in Figure 2.3, Latin Grooves 1: Solo Instruments CD contains 79 wav tracks. AT issue is the fact that the.wav format is not conducive to MPEG7 descriptors in the MIRAI database because one needs to convert the file. Furthermore, the.wav file contains many notes. Again, this is not acceptable for the MIRAI database as with

21 Figure 2.2: Marimba: single stroke and crescendo roll (2 entries per index) needs to be divided into marimba: single stroke and marimba: crescendo roll Figure 2.

potential scale both temporal and non-temporal features, the database needs to have single sounds of single instruments. Track 01, not seen in Figure 2.

35 21 Figure 2.2: Marimba: single stroke and crescendo roll (2 entries per index) needs to be divided into marimba: single stroke and marimba: crescendo roll Figure 2.3: For the violin sounds we have separated, bowed, natural harmonics, muted vibrato, martele, pizzicato and bowed vibrato because the aforementioned form distinct classifications of a violins potential scale both temporal and non-temporal features, the database needs to have single sounds of single instruments. Track 01, not seen in Figure 2.1 is named in the catalogue as: marimba: single stroke and crescendo roll (2 entries per index) The design of the MIRAI database deems that each folder contains the grouping of only one set of sound implementation possible from an instrument. For example, as seen in Figure 2.2, for the violin sounds we have separated, bowed, natural harmonics, muted vibrato, martele, pizzicato and bowed vibrato because the aforementioned form distinct classifications of a violins potential scale. Accordingly the marimba: single stroke and crescendo roll (2 entries per index) needs to be divided into marimba:

Figure 2.4: After opening the two newly duplicated and created marimba folders one can see the distinctive wave forms of the single stroke versus the crescendo rolls.

36 Figure 2.4: After opening the two newly duplicated and created marimba folders one can see the distinctive wave forms of the single stroke versus the crescendo rolls. We then delete all the rolls in the single stroke folder and delete all single strokes in the crescendo folder single stroke and marimba: crescendo roll (2 entries per index) later in Step 3. Automation of this Step is moot as only a human can determine when the MUMS, or any other database needs to be divided into one... to up to 7 folders as shown in the case of the violin. STEP 02: Open up folder where the database is being stored. In our case, it is under Dr. Ras s folder named ras. This database is online and alive hence making it possible to link from the internet. Looking at the contents of T:\ COIT\MYDEPT\cs\ras\MUMS as one can see there is no marimba folder. STEP 03: Create a set of marimba folders in ras database single stroke and crescendo roll as seen in Figure 2.3 STEP 04: Create a set of marimba folders in the MUM s temporary folder for the purpose of future automation. STEP 05: Open Sound Forge 8.0 and open up the two newly duplicated and created marimba folders. Clearly one can see in Figure 2.4 the distinctive wave forms of the single stroke versus the crescendo rolls. We then delete all the rolls in the single stroke folder and delete all single strokes in the crescendo folder. One could make a C program to distinguish between the

37 23 abruptness of the single stroke compared to the Gaussian-like attack of the crescendos, but all this time making such a program would be wasted when we went to another instrument, such as bowed versus martele on a violin. STEP 06: Check to see that file separations are properly executed and then save both files. We use the following windows script to automate 14 procedures as described herein STEPS 07 through 09: $\sharp$z:: Send, {CTRLDOWN}x{CTRLUP}{ALTDOWN}f{ALTUP}n{ENTER} sleep 100 Send, {CTRLDOWN}v{CTRLUP} sleep 100 Sleep, 100 Send, {ALTDOWN}f{ALTUP}a WinWait, Save As, IfWinNotActive, Save As,, WinActivate, Save As, WinWaitActive, Save As, Send, {TAB}{TAB}{TAB}{TAB}{TAB}{TAB}{TAB}{TAB}{TAB}{TAB}{DOWN}{UP} It should be noted at this point however, that even though these steps are automated, there is, as will be demonstrated, certain stages of the aforementioned automation that require human intervention, checking and lastly decision making. As seen here in STEP 07, the first stage of automation. The start point of the wave is not the first instance where there is a variance above a certain threshold form no sound to sound. Rather, it requires a human ear to distinguish noise or dirty sound of the person playing the instrument as he/she shuffles there feet, lets the instrument touch their clothing or allow the bow to accidentally touch the violin. Of course a C program could be easily made to distinguish - no sound to sound, but a human s ear is need to distinguish noise to viable, database sound. Automation of this step is already implemented after a human distinguishes noise from sound. STEP 08 AU-

38 24 Figure 2.5: Naming the tag for the sound: A - is the note, 2 - is the octave. Listen to the sound and compare it to the pitch on the keyboard. A computer would not have found these and would wreak havoc in our system if not checked. TOMATED: Creating new window: Already automated. STEP 09 AUTOMATED: Naming the selected and extracted sound in the new window: This step takes the longest time however, it is the most important step, naming the tag for the sound: A - is the note, 2 - is the octave. Right here, one could say - can t we scan in the MUM s catalogue, run it through an array and have it parsed onto the labeling? Yes - this is indeed a good idea, however, the MUM s database, and the Oginara database have incorrectly labeled many of the pitches. These are due to typos or simply not being musicians. IN short, I listen to the sound and compare it to the pitch on my keyboard. I have found seven mistakes so far. A computer would not have found these and would wreak havoc in our system if not spotted. Here I check the A2 as shown in Figure 2.5. Once good, I hit CNTL-V and pastes the marimba c rescendo M STEP 10 AUTOMATED: Looking at the window T:\ COIT\MYDEPT\cs\ras\MUMS oneseeswehaveone fileofthe A2 marimba crescendo M.au successfully inserted into Dr. Ras database. STEP 11: The above steps created one of the seven files in marimba: crescendo roll as seen in Figure 2.6, one needs to implement the batch file to continue extracting the other six files in marimba: crescendo roll. This is a batch file, automated as much as humanly possible. STEP 12: Continuing, another observance of human intervention, sometimes MUMS will name the C but with the incorrect octave. In this case it was 3 but sometimes MUMS incorrectly keeps the

39 25 Figure 2.6: Steps One through TEN create one of the seven files in marimba: crescendo roll. As seen, one needs to implement the batch file to continue extracting the other six files in marimba: crescendo roll. previous octave. Only by physically playing the note on the keyboard, listening to it and then checking it can our Database have the integrity we demand, see Figure 2.5. STEP 13 : All seven files have now been converted. STEP 14: Checking them in Sound Forge, we make sure the beginnings and endings are good. Note this is only the seven for crescendo as the single strokes still need to be ripped. STEP 15: As one can see in Figure 2.7 Sound Forge includes the SFK (Sound Forge Peak Data Files) and SFL (the cue points) in a separate file format but as files in the database. These will not be used by the database. Note there is no SFK or SFL on the single stroke - because they have not been ripped at this point. As can be seen in Figures 2.7 and 2.9 the respective files need to be transferred over to their folders. STEP 16: Note these are only seven of 56 sounds for Marimba Times two (crescendo - single note) = 112 sounds for Marimba, so fare we have shown how to do seven of the 112. STEP 17: The (8 x 2 = 16 x 7) remaining rips to be performed for Marimba

points) in a separate file format but as files in the database.

Note there is no SFK or SFL on the single stroke - because they have not been ripped

40 26 Figure 2.7: Sound Forge includes the SFK (Sound Forge Peak Data Files) and SFL (the cue points) in a separate file format but as files in the database. These will not be used by the database. Note there is no SFK or SFL on the single stroke - because they have not been ripped at this point Figure 2.8: The respective files need to be transferred over to their correct folders for database retrieval. Figure 2.9: TThe respective files need to be transferred over to their correct folders for database retrieval.

41 CHAPTER 3: PREREQUISITE KNOWLEDGE - Part II: Music One may skip this section if one is a musician. This thesis is not about music, rather it is about novel advancements in KDD and MIR. However, in order for one to appreciate the music and KDD connection this thesis proposes it is necessary to understand certain properties of music s physics, waveforms and theory. Following is a precise description of only the necessary musical knowledge necessary for the understanding of this thesis. Furthermore, the heading of the music properties described herein are structured in a manner whereby one can easily identify what sections to skip if one already knows the subject matter. 3.1 INTRODUCTION TO MUSIC THEORY AND KDD Musical Notes Humans characterize sound waves by three parameters: Pitch, Loudness and Quality. Loudness is measured in a logarithmic scale ( decibels ), defined as ten times the exponent of 10 for the loudness value. Pitch is the frequency of musical notes of arranged on a musical scale. Western culture s equitonic scale consists of octaves, each containing 8 whole notes A, B, C, D, E, F, G and A which comprise the white notes on the piano. The ratio from one of the keys to the next is the same for each key and is centered on the Middle C note at 264 hertz. This was seen in the famous Sound of Music when Julie Andrews teaches the fundamentals of music to the children, the white notes, in the form of a song; doh re - mi - fa - sol - la - ti - doh.[richard Rodgers, 1959]When one adds the black notes of the piano into the equation we have Western music s Chromatic scale: C C D D E F F G G A A B. Western scales, including the famous

42 D E G A B D E G A B D E G A B D E G A B D E G A B D E G A B G A B D E C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A B A 28 A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C octave octave octave octave octave octave octave Middle C Figure 3.1: The piano keyboard arranged in a repeating pattern of white and black keys. Each repeat of the pattern is called an OCTAVE. Each octave has one key corresponding to each of the 12 note names. The normal alphabet letters are white keys, the sharps/flats are the black keys. C [ C D ] D [ D / E ] E F [ F / G ] G [ G / A ] A [ A / B ] B Blues scale are all subsets of the equitonic and chromatic scales [Kim et al., 2000] In terms of frequency, the pitch interval between neighboring notes is called a half-step which is a 6% increase in frequency. Two half-steps make a whole-step which accordingly is a 12% increase in frequency. The black notes on the piano keyboard represent half-steps and can be referred to as being a half-step up from or a halfstep down from a particular white note: C [ C D ] D [ D / E ] E F [ F / G ] G [ G / A ] A [ A / B ] B. Where for example, C ispronounced C-sharp and means the note that is a half-step higher than C. Also, the same note C may be refereed to as D, pronounced D-flat, or the note that is a half-step lower than D. See Figure 3.1 The Perfect Fifth Through trial and error Western music composers have found guaranteed success in conveying happiness by using certain combinations of notes in the chromatic scale namely what is called The perfect fifth. For Generation upon generation mothers have found that nursery rhymes comprising perfect fifths work best with their babies such as Alouette, Rock a By baby and Twinkle Twinkle Little Star. Before going to perfect fifths, let s make sure that one understands that the chromatic scale goes up

43 Chromatic in A Chromatic in C D E G A B D E G A B D E G A B D E G A B D E G A B D E G A B G A B D E C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A C D F G A B A 29 A B C D E F GABC D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C D E F G A B C Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth Figure 3.2: Two examples of the Chromatic Scale, one starting in A and the other in C. Also is The circle of fifths which repeats itself each twelve sequences: C G, G D, D A, A E, E B, B F, F C, C G, G D, D A /B, B F, F C. Most Western Children s songs and happy music is founded on this essential bedrock of feel-good music. Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth Perfect Fifth one half-step in pitch when one moves one key to the right which is typically a key of a different color, in most cases, but sometimes one goes from a white note to another white note. This is the chromatic scale, which comprises the series of said half-step intervals commonly known as doh-di-re-ri-mi-fa-fi-sol-si-lali - ti - doh. See Figure 3.2. A perfect fifth comprises two notes that are a fifth apart. They sound very good together, blend well, and as mentioned, Western music has latched onto them as feel good notes. For example, when one plays a C note then plays the next G note to the right it sounds beautiful and familiar. Yes, that s Twinkle Twinkle Little Star. One can start any perfect fifth on any note name. The other note is actually seven half-steps to the right. Following is The circle of fifths which repeats itself each twelve sequences: C G, G D, D A, A E, E B, B F, F C, C G, G D, D A /B, B F, F C. See Figure 3.2 Major Third A second significant interval that is grounded in Western music is the Major Third comprising two notes which are a third apart. Like the Perfect Fifth, this combination sounds pleasant to Western listeners and is a staple pattern found in happy Western Culture song writing. For example, is one plays a C note and then plays the next E

44 Major Major Major Major Major Third Third Third Third Third Figure 3.3: Like the Perfect Fifth, one can start a major third on any note so long as the next note is four half-steps to the right as illustrated herein with eight randomly chosen Major Thirds: C E, D F, F A, A C, F -A, C -F,and E G note to the right it sounds to the vast majority of Westerners as a beautiful experience. Like the Perfect Fifth, one can start a major third on any note so long as the next note is four half-steps to the right: C E, D F, E G, F A, G B, A C, B D. See Figure KDD and The Physics of Music This thesis asserts that for a machine to recognize and retrieve emotions in databases, it must understand the physics, Fourier transforms and harmonics of waveforms. To understand harmonics, consider a simple sinusoidal wave having the form y = Asin(ωt) (3.1) where A is the amplitude. A is also periodic, meaning that the wave having a frequency, f, repeats itself with a period, T. f = frequency =1/T (3.2)

45 31 T = period =1/f (3.3) When sounds emanates from an instrument there is a fundamental frequency accompanied by integer multiples of the fundamental frequency called overtones. As mentioned above, overtones that are integral multiples of the fundamental are called harmonics. We can express a waveform F(t) simply as the series addition of harmonics: Some of these harmonics, when paired with others give westerners a sense of a good sounding notes. These musical intervals include: Table 3.1: Intervals Unison 1:1 - Major Third Octave 2:1 - Minor Third Fifth 3:2 - Major Sixth Fourth 4:3 - Minor Sixth Accordingly there are groups of waveforms consisting of three or more notes that also have nice or agreeable responses to westerners and we call these chords. Major chords have three notes in a ratio of 4:5:6 The ratio of the frequencies of the major diatonic scale. f is the frequency of the root or tonic. We acknowledge that a major chord has a natural and pleasant sound. Mathematically, this makes sense in that there is perfect integration of each harmonic s patterns. For example, in A Major, the sixth harmonic is between F# and G, and the seventh is A again three octaves up. Contemporary notions describe the aforementioned as Happy. We believe this is not completely correct. Contemporary notions believe that the sad notes of the blues are simply realized by a slight shift in the harmonics. For example, if one moves from the happy harmonics of a C Major chord (C E G) to the C minor chord. Contemporary thought believes that the induction of a darker feel is based completely on the minored shift in the E to a D# as this results in a missing harmonic thus creating a darker feel. We propose that this is not completely correct in that it is, as

46 32 Figure 3.4: Fourier: Fundamental will be demonstrated, too simplistic and overly broad: Table 3.2: Happy harmonics of a C Major chord C E G Table 3.3: Darker harmonics of a C Minor chord C D# G FOURIER AND THE HUMAN EAR The human ear takes the complicated sound waves, as illustrated above in the shift from a C major to a C minor, and measures the relative phases of their overtones into a perception of the timbre of the note. Fourier proved that any vibration can represented mathematically. Looking at sawtooth wave: Fourier takes the fundamental and the first harmonic and adds them together. In the next step Fourier takes sum, as shown above and adds it to the next overtone:

47 33 Figure 3.5: Fourier: Next overtone Figure 3.6: Fourier: 9th Overtone

48 Fourier continues the aforementioned until one gets to the 9th overtone as it is close to the desired saw tooth shape. The dichotomy of researching for a mathematical validation to emotions invoked from sound waves entering the ear is that subjects who know nothing of the aforementioned can be experts. In fact, every human is an expert on what they personally perceive emotionally after absorbing sound waves into their eardrum. It is hear that our research, stepped back and viewed the forest for the trees. The small study was conceived when during previous research, our team pondered how long each sample should be. It became evident that on samples where chord changes shifted rapidly, the team sometimes opted for a smaller sample times. Conversely, on longer songs, where the chord changes were long and drawn out, particularly on Mozart s Requiem, the team agreed on a much longer sample rate. Why? We decided to flip former methodologies of playing samples to subjects and blindly asking them what emotion they felt and decided to rather: 1. Induce or suggest what the emotion for a sample should be based upon what others thought. Where other was contemporary theories such as Huron; 2. Induce or suggest what the emotion for a sample should be based upon what others thought. Where other was the inverse of contemporary theories such as Huron; and, 3. Measure the results of the aforementioned in terms of four degrees: (a) Distance from a chord change. (b) Distance from a tonal change. (c) Acceptance of contemporary notions. (d) Acceptance of the inverse.

49 Music Theory in a Nutshell Music Waveforms Notes in scales make up melody, defined as an auditory object that emerges from a series of transformations along six dimensions: pitch, tempo, timbre, loudness, spatial location, and reverberant environment [Thompson and Parncutt, 1997]. An octave is a doubling of frequencies, therefore its interval is the 12th root of 2 ( ). Scales and chords utilize harmonics and overtones. In order to understand the blues scale one needs to also appreciate scales and harmonics. A harmonic exists when one multiplies a notes frequency by a whole number. Harmonics and overtones are the same thing labeled differently. For example, the first overtone of a frequency equals the second harmonic. This paper will refer to harmonics for the sake of consistency. Note that the second harmonic is a note with twice the frequency, or, commonly known as an octave. The most common scales is the major scale and its seven modes which are the same scale but starting with a root note on another note. It is said that each of the seven modes has a distinct emotion to it. Westerners find that the most familiar sonority is the major and minor triads which measures the tonal profiles of sonorities, including perfect fifth and major thirds also known as simultaneous dyads. The major or Ionian is for happy music. The minor, or Aeolian and Dorian scales that have roots on the 2nd and 6th notes of the major are for sad or dark music. The Blues scale, which is based on the Pentatonic is found using the same method as the Aeolian and Dorian scales except it only incorporates four notes after the root. However, in modern music we say the Pentatonic has 6 notes because we include the playing of the root, one octave higher as being part of the scale. The African slaves began copying and then either mistakenly or purposefully flattening its III and VI notes over this pentatonic scale. Something, yet fully understood by science, inherent in the aforementioned sound formation evokes a human to feel sad.

50 The Blues Scale The Blues is a music genre most likely to produce a specific emotion to humans assimilated with western culture. Conversely, the musical genres of Jazz, Classical, Country and Rock include musical structures and instances that can make some humans happy but others excited. Psychologists structure emotion in terms of core affect [Russell, 2003], emotion, mood, attitude [Russell et al., 1989], and temperament [Watson, 2000]. Certain instances make some humans agitated [Yang and Wonsook, 2004] while making others feel motivated [Clynes and Nettheim, 1982]. Indeed, Ledoux [Ledoux, 1992] discovered a bundle of neurons directly connecting the thalamus with the amygdala that initiate an emotional response before the cortical centers are even engaged. Certain instances make some humans mellow while making others excited and so on [Clynes, 1985]. However, the Blues is different. We choose to analyze the Blues scale because for the most part, it is generally accepted that it has a greater chance, over other scales, to induce a specific emotion. Webster s 1913 Dictionary defines the blues as 1) a type of folk song that originated among Black Americans at the beginning of the 20th century; has a melancholy sound from repeated use of blue notes; and 2) a state of depression; as, he had a bad case of the blues. Nevertheless, what is the mathematical explanation for this anomaly? In other words, what is happening in the mathematical interpretations of waveforms emitted from a blues scale that invokes human emotions of sadness or melancholy? Many have doubts this can be attained [Landry, 2004]. In the famed blues songwriter W. C. Handy s autobiography he references an itinerant player who played a blues song and used a knife as a guitar sometime around 1900 [Barr, ]. Before synthesizing the blues scale notes this paper acknowledges that Blues music, as a whole, contains three attributes: 1. a rhythm is closely associated to African rhythms [Aptheker, 1969]. 2. a pentatonic sounding music accentuating its flattened III and VII [Eck and Schmidhuber, 1

51 3. a call and response structure similar to European and English folk music using the same three chords over a diatonic scale. 37 Blues rhythms can easily be associated to African rhythms because it originated with the African slaves in North Mississippi Delta prior to the Civil War of the United States. There is also a myth, which this thesis contends is not too far from the truth: Blues folklore states that the slaves would hear the piano of their white owners playing and try to replicate it in their guitar playing field hollers, ballads and spiritual/church music. However, they either, 1. never quite got it right, or, 2. they purposefully, simply, preferred to flatten the III and VII because in their state of sadness and depression - it simply felt better.... Blues lyrics, are not part of the mathematical structure of notes that emit emotions, but suffice to say, it typically encompass misfortunes and trouble. However, taking these three points into consideration, we know that there is 1. relentless rhythm in the music that 2. repeats, with the use of 3. flattened III s and VII s, which taken as a whole fits perfectly with a sorrowful story and the forlornness of a lost soul many times over.... With the aforementioned in mind, we now look a little deeper into the musical structure. 3.4 Representing Scales Medical Science is recognizing that human being are in fact affected emotionally by music [Praat, 2004] [Praat et al., 1996]. For the most part, Music Information

52 Retrieval (MIR) research in musical scales is for the purpose of queering groups 38 of notes in melodies. Here, the pentatonic scale is often used as it only has five notes in the scale [Zobel and Uitdenbogerd, 2004]. Determine the musical scale for a set of notes may initiate new research in both queering music and in correlating sets of notes to emotional states of being. The basis of recognizing a musical scale is first recognizing the pitch of each note comprising the scale, including time-domain and frequency-domain methods; autocorrelation, cepstral coefficients, zero-crossings, Average Magnitude Difference Function and so on [Kaminskyj, 2000a], [Czyzewski and Szczerba, 2002],[Kostek and Czyzewski, 2001a], [Wieczorkowska et al., 2003b], [Klapuri, ], [Kostek and Czyzewski, 2001b],[Sylvain, 2001]. Finding the fundamental frequency of a sound wave enables one to determine the pitch of the sound wave. The middle A above middle C has a frequency of 440 Hz in standard tuning. The frequency is doubled to make the same note an octave higher, and halved to make the same not an octave lower. The distance of all the other frequencies in temporary tuning is calculated using semitones. The frequencies within an octave, starting from a given note and going up in the frequency scale, can be calculated using coefficients according to the following formula: f k = f 1 2 k/12 (3.4) where k - number of semitones separating f k and f 1. Finding the fundamental frequency of a series of notes determines the relationship and the musical scales. Furthermore, we believe that an effective retrieval of pitches and their correlating scales will, as research progresses, lead to matching sets of sound waves to emotions. If a system knows, for example, that it is operating in the key of C and it encounters a pitch of Hz, then this C note is called the fundamental tone. The fourth semitone (i.e. distance of 4 semitones up from C), which is E, and the seventh semitone, which is G, together produce the tonic chord forthekeyofc.if

53 we analyze harmonic spectrum for the C note, then we have the following components in the spectrum: Hz - note C, Hz - note C, but an octave up, Hz - note G, Hz - note C, Hz - note E (but not perfectly in tune), Hz - note G, Hz - B flat (not perfectly in tune), Hz - note C, Hz - note D (not perfectly in tune), Hz - note E (not perfectly in tune) Higher harmonics form the remaining notes of the tempered scale for the key of C (although not always in perfect tune, according to contemporary tuning system, described by the formula given above) [Sevgen, 1983]. The vast majority of scales in contemporary Western music consist of 5 to 7 different notes (pitches). To calculate the number of possible scales for all 5 to 7 note scales we assert that the starting and ending notes are fixed and that there are twelve notes in an octave, while this leaves 10 notes between the starting and ending notes. Also, we consider each individual note by moving from the lowest to the highest note. We cannot repeat a note and this all leaves only one possible order that any scale can be in. The semitones are N and the tonic t 1 form the remaining notes t2,..., t M in the scale and are distributed over the remaining N 1 points. Scales can be represented

54 40 Table 3.4: CMajor:The header represents the Tonic at 1 and the following notes of a scale. For a Major scale in the key of C, the tonic is C, the 2nd note, 2m is D, the 3rd note 3m is and E and so forth C D E F G A B Table 3.5: Generic Major: The header represents the Tonic at 1 and the following notes of a scale. For a Generic Major, meaning it could be in any key, the distance from the tonic to itself is zero, the distance from the tonic to the second note is 2 semitones, the second note to the third note is 2 semitones, the third note to the fourth note is 1 and so forth using the so called the Spiral Array, where pitch states are associated by coordinates downward along an ascending spiral [Chew, 2002b]. For each scale there exists a set of chords and special musical notes we call roots, thirds and fifths. These special notes wave forms interact in a predictable and harmonious manner with the root note we call harmony. As illustrated in Table 7.10, the diatonic progression contains three major chords called 1, 4, and5 and three minor chords (2m, 3m, and6m). The distance from the tonic to itself will always be zero, so we eliminate the redundancy of stating that the distance from the Tonic to itself by starting with the distance from the tonic to the 2nd note in the scale. See Table 3.5: For purposes of this paper, we have selected some substantive scales and modes for analysis of the 12 different pitches in an octave, that may constitute scales [Sevgen, 2000]. Modes are similar to scales in that they also list the notes that are used in a piece of music. See Table 3.4. However, modes include notes that are used for endings and recitations. One will notice for example that most scales illustrated in Table 3.4 contain a distance of 2 from the root or tonic. Only the Spanish Scale and Phrygian Mode contain

55 Table 3.6: Scalar Charts Depicting Significant Musical Scales: Node T:3:1 - The Augmented Scale, Node T:3:2:1 - The Blues Major, Node 1:3:2:2 - The Minored Pentatonic and Blues Scale Major Scale Pentatonic Major Scale Blues Major Scale Melodic Minor Scale Minor Scale Lydian mode Mixolydian Mode Phrygian Mode Ionian Mode Dorian Mode Harmonic Minor Scale Augmented Scale Pentatonic Minor Scale Be-Bop Scale Blues Minor Scale Spanish Scale Jazz Minor Mode Hawaiian Scale a distance of 1 from the root or tonic. Lastly, only Blues Major Scale, Pentatonic Minor Scale, Blues Minor Scale and the Augmented Scale contain a distance of 3 from the root or tonic (see Fig. 3.7). To demonstrate the viability of linking two states, an example that falls within the domain of scales represented in Table 3.4 is shown in Fig This figure illustrates a scale-finding algorithm that when given a sequence of notes will determine the scale. This takes into account how, it will not know what key the music is in, neither, whether a particular set of notes are the first notes, the middle or the last notes in a piece of music. The implementation of the recognition of scales and nodes, represented graphically in Figure 3.7, is shown in the Algorithm below. The main problem is to find the order of notes given as an input to the algorithm that is finding the scale, i.e. find the first note (the tonic). Variables include:

56 42 Spanish Scale Phrygian Mode 1 Tonic 1 Hawaiian Scale Augmented Scale 2 1 Blues Major Scale 2 2 Dorian Mode Jazz Minor Mode Minor Scale Harmonic Minor Scale Mixolydian Mode Ionian Mode Major Scale Be-Bop Scale Lydian mode Melodic Minor Scale Pentatonic Minor Scale Blues Minor Scale Pentatonic Major Scale Figure 3.7: calar Tree Depicting Significant Musical Scales: t-1: denotes Spanish or Phrygian, t-2-1: denotes minors, t-2-2: denotes Major contemporary Western scales, t-3-2 denotes sad, blues scales 1. φ - Frequency Input of forthcoming musical signals. 2. β - Counter of sets of distinguishable frequencies. 3. α - Array cache holding different notes in the musical piece is. 4. ζ - List to sort. 5. ψ - Index List 6. ω - VarT Testing scale using a chosen note in the queue as the Tonic, The initialization sets the counters where Frequency Input of the musical signal forthcoming are φ, Counter, counting possible sets of distinguishable frequencies, between 5 and 13 different notes is β. The cache holding all different notes in the musical piece is α. In the sort phase up to 13 notes from the cache α are housed in the array β, where a simple sort program sorts the notes lowest to highest which contains two varying lists: The List to sort,ζ and the Index List, ψ. Assume nine notes are entered: EEFCEGGAG, in all equal notes are discarded we are left with

57 43 where the array β consist of notes EFCGA on the first two loops the order remains EFCGA, then CEFGA and finally ACEFG. Initially a system will consider whether the lowest note is the Tonic, in this case it is A. There are some instances where a scale could comprise tonics in two scales. For example the Minor pentatonic is shifted up four intervals from the major pentatonic. However, fundamental frequencies from bass lines in a polyphonic source will deduce the correct schema. In this case we will assume that a bass line is set to C making the system realize these notes are in the key of C, but in an unknown scale as a human we take the first two notes which yield a difference of three which could yield T:3:1 The Augmented Scale, T:3:2:1 The Blues Major Scale T:3:2:2. The Minored Pentatonic and Blues Scale. Working through this it arrives at T:3:2:1 The Blues Major Scale, however, with the notes ACEFG inside the cache, knowing it is in the scale of C, it transforms the order to a baseline of C making CEFGA Balanced Trees: A Structure for Musical Scales The tree structure that houses the final search for the scale needs to accommodate the fact that each node may have a variable number of keys and children. This rules out most types of binary trees except for hash trees and Rudolf Bayer s Balanced Trees, otherwise known as B-Trees. [Bayer and McCreight, 1972]. B-trees can contain a continually varying set of keys and children where each key is cached in non-decreasing order and has a child that is its root for the subtree containing all keys that are less than or equal to the key but greater than the preceding key. From the outset, the algorithm sets forth a minimum number of allowable children for each node (minimization factor). A node has an additional rightmost child that is the root for a subtree containing the keys greater than all the keys contained in the node - the beginning of new branches. Every node may have at most 2t - 1 keys (where t = minimum degree) or, equivalently, 2t children. B-trees were typically used as an ideal data structure for situations where all data cannot reside in primary storage

58 44 and accesses to secondary storage were either expensive or time consuming Balanced Scaler/Emotion Tree In our system, each branch will represent a set of scales, the notes of the scales, and the xyz position in the 3-D emotional domain. The point of the balanced tree is, for example, if the music has taken the search to a particular branch and dissonant notes, or unexpected notes appear, then the tree can balance itself into two new nodes while retaining the correlating xyz position on the emotional domain. In our case, the essential element is that the fuzziness of both MPEG-7 fundamental Frequency descriptors together with the fuzziness of placing an emotion into the soon to be explained 3-D emotional chart - lead to a tree that will always vary. This will be true temporal terms as it searches frequencies and accommodates new justifications for placing a particular emotion to a particular musical scale. 3.5 Neuroendocrinology Neuroendocrinologists measure data from neurotransmitters in laboratories and analyze resulting data in the Profile of Mood States (POMS) showing Depression/Dejection scores. Recently at UCLA, it has shown that depression scores decreased in the keyboard group not in the control group. It concluded that with decreasing depression people reported a brighter mood and; since depression is a major problem for older adults, these findings were especially uplifting [Kumar, ]. On the point of emotional analysis, the science contains a varied array of papers concerning emotions in music [Sevgen, 1983] [McClellan, 1966]. At one extreme, the renown Psychologist, Gerald G. Jampolsky, MD says there are only two primary emotions, love and fear, from which all other emotions are derived. Conversely, Robert C. Solomon, one of the distinguished psychologist in emotions has identified forty-two emotions including: Anger, Anxiety, Anguish, Jealousy, Pity, Pride, Regret, Remorse, Sadness, Shame, Spite, Dread, Duty, Embarrassment, Envy, Fear, Grief, Vanity, Self-Contempt, Self-

59 45 Hatred, Self-Pity, Self-Love, Joy, Self-Respect, Worship, Frustration, Guilt, Hate, Indifference, Indignation, Terror, Faith, Friendship, Hope, Innocence, Love, Respect, Contentment, and Rage [Jackson, 1998]. In 1991, Sloboda surveyed a group of 34 professional musicians to learn what physical responses the participants related to, these included; Shivers down the spine, Laughter, Lumps in the throat, Tears, Goose pimples, Racing heart, Yawing, and Pit of stomach sensation [Jackson, 1998]. We hold that Jampolsky s two dimensionality has validity in that it represents only the x-axis of a three dimensional emotional state. Furthermore, the authors propose that the more physical and testosterone-based emotions of Hevner s circle and Solomon s can be represented objectively via the y-axis and z-axis shown in Fig Scalar Emotions Before linking the partial scalar tree from Figure 3.7 to the three dimensional emotion space, this thesis illustrates the classification of the categories of emotions normally associated with the aforementioned scales. Accordingly, looking at the first node of Fig. 3.7, when traveling a distance of 3 from the Tonic to the possibilities include (in this thesis s limited schema, as shown in Figure 3.7) the Augmented scale, Blues Major Scale, Pentatonic Minor scale and Blues Minor Scale. Node T:3:1 The Augmented Scale The Augmented scale is typically a sadder scale. It is often used by Heavy Metal Bands such as Iron Maiden when they compose songs that invoke sadness. Broadway will augment the music piece when something sad occurs in the music. The Augmented scale is not a Whole Tone scale. As shown in Table 3.4, it contains a pattern of T In musical terms it contains the Tonic, the major 3rd and augmented 5th intervals, which is similar, but not quite, to a Whole Tone scale. What is striking is that when traveling from the Tonic to node T:3, no matter where one proceeds from this juncture, the scales stay sad.

60 46 Node T:3:2:1 The Blues Major Scale The Blues Major Scale is typically heard with its emphasis on the blues scale with a major 3rd, and being played over a moving rhythm. Hence the term Rhythm and Blues, swing or jump music. Black Musicians developed Rhythm and Blues in the 1940 s by combining blues and jazz. In the 1950 the Blues Major scale was seen in Rock and Roll. In the 1980 s, after the decline of disco Rhythm and Blues, Major Blues Scales scales are utilized in soul and funk influenced African-American pop music: the sadder the song, the more the major 3rd was emphasized. As shown in Table 3.4, it contains a pattern of T Node 1:3:2:2 The Minored Pentatonic and Blues Scale The pentatonic, five note scale, has for the most part arisen in folk music. The two dominant forms are the minor pentatonic and the major pentatonic. The minor pentatonic omits the 2nd and 6th degrees of the minor scale while the major pentatonic omits the 4th and 7th scale degrees of the major scale. The minor Pentatonic which is known for being darker is used primarily in rock and then Major Pentatonic is used primarily in lighter country, crossover country/pop or jazz - because the scale is sweeter, lighter more bright sounding [Wieczorkowska et al., 2005] Conclusion of Neuroendocrinology The possibilities for creating two state machines, one for emotional state in a 3 dimension spatial array and the other a Music Information Tree depicting the nodes of scales, are endless. The motivation behind this thesis is knowing that once we can accomplish the aforementioned, namely: Identifying a scale and correlating it to an emotional state, we will be able to reverse the procedure and connect the system to a human being attached to a neuroendocrinologists device revealing the patients state - herby we can theoretically direct a patient s emotional state from one undesired state to possibly another doctor prescribed new state. As illustrated in this thesis

61 47 the viability is evident, states of emotion and states of musical scales can indeed be represented. The algorithms and testing phase will dominate future work, however, this is an exciting new domain and point of view in Neuroendocrinology, Knowledge Discovery and Music Information Retrieval.

62 CHAPTER 4: PREREQUISITE KNOWLEDGE -Part III: Digital Music This section discloses the manner in which we have developed a methodology and database system that associates the Emotional 3D Domain with Music Instruments 4.1 Associating the Emotional 3D Domain with Music Instruments. To establish a methodology the authors decided that first, humans had to start associating sounds with emotions and secondly, that whatever humans associated sounds, to ensure consistency, they would have to associate all of the 4,400 sounds in the database. This takes time, so a methodology was modeled using 4 humans consisting of 3 musically trained musicians and a part time drummer who has no music ear but listens to a lot of music. Sounds of instruments were played and the persons voted in terms of the 33 emotions listed in Table1.3 together with one of three strength factors with 3 being strong, very sure of the emotion, 2 being neutral and 1 being weak, not too sure of the emotion. Looking at Table1.3 which is a sample of 36 of the 4,000 instrument sounds one will note the first attribute consists of the name of the sound. The first number consists of the octave, the next letter is the note, then the name of the instrument and the last letter denotes the original database picked up from. Each listener was required to pick an emotion from the list of 33 emotions and then associate how strongly they felt. For example, in tuple 1, 4F French Horn, the first listener named Subject1 described that F note on the 4th octave being played on a French horn as Very Melancholy, thus giving it Melancholy3. Jumping over to Attribute nsubject1, is the systems Emotional 3D Domain listed in Table1.3 which is 2,-1,1. However he rated this emotion strongly, with a 3, therefore attribute nsubject1 describes this choice as 3, 1, 13. Accordingly, listener 2 described the same sound

63 49 Calculating Cluster Value ([( 3, 1, 1)*3]+[(2, 1, )*1]+[(2, 2, 2)*2]+[(,, 3)*3])/9 = ([ 9, 3, 3]+[2, 1, ]+[4, 4, 4]+[,, 9])/9 = ( 3, 2, 16)/9 = 0.33, 0.22, 1.77 = 0, 0, 2. This is calculated for all 4,400 sounds. Table 4.1: Normalizing 4 representations of a sound with weighted states taken into consideration for clustering purposes. as moody weak which generated a nsubject2 as 2, 1, 1. Listener 3 described the same sound as Comforting neither weak nor strong which generated a nsubject3 as 2, 2, 2 2. Finally, listener 4 described the same sound as Brave strong which generated a nsubject4 as,, 3 3. Calculating the value of this notes is a simple average of the summation of the locations in the Emotional 3D Domain. 4.2 Initial Mining of Rules for Music Instruments The objective of this thesis is to lay out a foundation that in conducive to mining rules for emotions associated with instrument sounds. The approach the authors have is that emotions can now be empirically measured, associated and mined in context to instruments and all their MPEG-7 attributes. The distribution along the axis seems to be delicate within the 0 to 1 context and then stretched out over the 1-5 context. Further research into how the summation of more inputs from more users can be evenly distributed without simply drawing back to zero.

64 50 3-D Emotion Domain Name Subject1 Subject2 Subject3 Subject4 nsubject1 nsubject2 nsubject3 nsubject3 Total 4F frenchhornm Melancholy3 Moody1 Comfort2 Brave CfrenchHornmutedM Melancholy3 Moody2 Confidence2 Brave DtubularbellsM01 Melancholy3 Whimsfull3 Angst1 Delight AsteeldrumssinglestrokeM04 Melancholy3 Gladness3 Irritability3 Delight BsteeldrumssinglestrokeM01 Melancholy3 Gladness3 Irritability3 Delight A piano9ftsteinwaypluckedm01 Melancholy3 Amusement2 Cautiousness1 Delight Apiano9ftsteinwaypluckedM01 Melancholy3 Amusement2 Cautiousness1 Delight F vibraphonehardmalletm01 Melancholy2 Free3 Irritability3 Delight BviolinmutedvibratoM Depressive1 Comfort1 Determination3 Ecstacy GviolinmutedvibratoM F eelgood1 Comfort1 Determination3 Ecstacy F violinmutedvibratom Melancholy3 Comfort1 Determination3 Ecstacy C violinmartelem F eelgood2 Comfort1 Gladness1 Ecstacy EviolinensemblebowedM F eelgood2 Comfort1 Happiness1 Ecstacy F glockenspielbrassbeaterm01 F eelgood2 Elation1 Irritability3 Ecstacy CviolinpizzicatoM F eelgood1 Comfort1 P aranoia1 Ecstacy F oboem F eelgood2 Apprehension1 Comfort2 Ecstacy G piccolom F eelgood2 Zest2 Comfort2 Ecstacy CsteeldrumssinglestrokeM01 F eelgood2 Gladness3 Irritability3 Ecstacy G piccoloflutterm F eelgood2 Zest2 Apprehension3 Ecstacy F piccolom F eelgood3 Zest3 Comfort2 Ecstacy C violinartificialharmonicsm F eelgood3 Comfort1 Glee1 Ecstacy DpianohamburgsteinwayloudM01 F eelgood3 Gladness2 Irritability3 Ecstacy CmarimbacrescendoM Depressive1 Kindness1 Angst1 Free GbassclarinetM Depressive1 Somber2 Acceptance3 Gratitude C bassfluteflutterm Depressive1 P atience3 Apprehension3 Gratitude G piano9ftsteinwaypluckedm01 Depressive1 Amusement1 Cautiousness1 Gratitude CsaxophonebassM01 Depressive2 Humility2 Acceptance3 Gratitude A bassoontm Depressive2 Mellow3 Comfort2 Gratitude AcellobowedvibratoM Depressive2 Somber3 Determination3 Gratitude C cellomutedvibratom Depressive1 Somber2 Melancholy Gratitude GviolapizzicatoM F eelgood1 Desire2 P aranoia1 Cautiousness C cellopizzicatom F eelgood1 Melancholy1 P aranoia1 Cautiousness C saxophonealtom01 F eelgood1 Comfort2 Acceptance1 Cautiousness EctrumpetharmonStemOutM F eelgood1 Brave3 Confident1 Kindness C fluteflutterm Melancholy3 Joy1 Apprehension3 Whimsfull Table 4.2: 3-D Emotion Domain: Representation of 31 emotions with their corresponding placement in a 3-d domain

65 CHAPTER 5: PREREQUISITE KNOWLEDGE - Part IV: MPEG-7 and Timbre Descriptors 5.1 Timbre Descriptors We use Timbre descriptors to distinguish beween the timbre of variuos instruments. For example, in our 2005 paper, Extracting Emotions from Music Data. [Wieczorkowska et al., 2004a] we used a long analyzing frame (32768 samples taken from the left channel of stereo recording, for Hz sampling frequency and 16-bit resolution), in order obtain more precise spectral bins, and to describe longer time fragment. We applied a Hanning window to the analyzed frame and spectral components up to 12 khz were taken into account. The following set of descriptors was calculated [Wieczorkowska et al., 2003a]: Frequency: dominating fundamental frequency of the sound Level: maximal level of sound in the analyzed frame T ristimulus1, 2, 3: Tristimulus parameters calculated for Frequency,givenby [Pollard and Jansson, 1982]: T ristimulus1 = A 2 1 N n=1 A2 n (5.1) T ristimulus2 = n=2,3,4 A2 n N n=1 A2 n (5.2) T ristimulus3 = N n=5 A2 n N n=1 A2 n (5.3) where A n denotes the amplitude of the n th harmonic, N is the number of harmonics available in spectrum, M = N/2 and L = N/2+1

66 52 EvenHarm and OddHarm: Contents of even and odd harmonics in the spectrum, defined as M EvenHarm = k=1 A2 2k N n=1 A2 n (5.4) OddHarm = L k=2 A2 2k 1 N n=1 A2 n (5.5) Brightness: brightness of sound - gravity center of the spectrum, defined as Brightness = N n=1 na n N n=1 A n (5.6) Irregularity: irregularity of spectrum, defined as [Fujinaga and McMillan, 2000], [B. Kostek, 1997] ( N 1 Irregularity =log 20 log k=2 ) A k 3 Ak 1 A k A k+1 (5.7) Frequency1, Ratio1,..., 9: for these parameters, 10 most prominent peaks in the spectrum are found. The lowest frequency within this set is chosen as Frequency1, and proportions of other frequencies to the lowest one are denoted as Ratio1,..., 9 Amplitude1, Ratio1,..., 9: the amplitude of Frequency1 in decibel scale, and differences in decibels between peaks corresponding to Ratio1,..., 9 and Amplitude1. These parameters describe relative strength of the notes in the music chord. Since the emotions in music also depend on the evolution of sound, it is recommended to observe changes of descriptor values in time, especially with respect to music chords, roughly represented by parameters Frequency1, Ratio1,..., 9; we plan such extension of our feature set in further experiments.

67 CHAPTER 6: MINING EMOTIONS IN MIR In the continuing goal of codifying the classification of musical sounds and extracting rules for data mining, we present the following methodology of categorization, based on numerical parameters. The motivation for this thesis is based upon the fallibility of Hornbostel and Sachs generic classification scheme, used in Music Information Retrieval for instruments. In eliminating the redundancy and discrepancies of Hornbostel and Sachs classification of musical sounds we present a procedure that draws categorization from numerical attributes, describing both time domain and spectrum of sound. Rather than using classification based directly on Hornbostel and Sachs scheme, we rely on the empirical data describing the log attack, sustainability and harmonicity. We propose a categorization system based upon the empirical musical parameters and then incorporating the resultant structure for classification rules.[lewis and Raś, 2007b] 6.1 Instrument Classification Information retrieval of musical instruments and their sounds has invoked a need to constructive cataloguing conventions with specialized vocabularies and other encoding schemes. For example the Library of Congress subject headings [Brenne, 2004] and the German Schlagwortnormdatei Decimal Classification both use the Dewey classification system [Doerr, 2001], [patel, 2005]. In 1914 Hornbostel-Sachs devised a classification system, based on the Dewey decimal classification which essentially classified all instruments into strings, wind and percussion. Later it went further and broke instruments into four categories: 1.1 Idiophones, where sound is produced by vibration of the body of the

68 54 instrument 2.2 Membranophones, where sound produced by the vibration of a membrane 3.3 Chordophones, where sound is produced by the vibration of strings 4.4 Aerophones, where sound is produced by vibrating air. For purposes of music information retrieval, the Hornbostel-Sachs cataloguing convention is problematic, since it contains exceptions, i.e. instruments that could fall into a few categories. This convention is based on what element vibrates to produce sound (air, string, membrane, or elastic solid body), and playing method, shape, relationship of parts of the instrument and so on. Since this classification follows a humanistic conventions, it makes it incompatible for a knowledge discovery discourse. For example, a piano emits sound when the hammer strikes strings. For many musicians, especially playing jazz, the piano is considered percussive, yet its the string that emits the sound vibrations, so it is classifies as a chordophone, according to Sachs and Hornbostel scheme. Also, the tamborine comprises a membrane and bells making it both an membranophone and an idiophone. Considering this, our thesis presents a basis for an empirical music instrument classification system conducive for music information retrieval, specifically for automatic indexing of music instruments A three-level empirical tree We focus on three properties of sound waves that can be calculated for any sound and can differentiate. They are: log-attack, harmonicity and sustainability. The first two properties are part of the set of descriptors for audio content description provided in the MPEG-7 standard and have aided us in musical instrument timbre description, audio signature and sound description [Wieczorkowska et al., 2004b]. The third one is based on observations of sound envelopes for singular sound of various instruments and for various playing method, i.e. articulation.

69 55 Signal Envelope (t) t T0 T1 Figure 6.1: Log-attack time T0 can be estimated as the time the signal envelope exceeds.02 of its maximum value. T1 can be estimated, simply, as the time the signal envelope reaches its maximum value LogAttackTime (LAT) The motivation for using the MPEG-7 temporal descriptor, LogAttackTime (LAT ), is because segments containing short LAT periods cut generic percussive (and also sounds of plucked or hammered string) and harmonic (sustained) signals into two separate groups [Gomez et al., 2003], [J.M.Martinez and R. Koenen, 2001]. The attack of a sound is the first part of a sound, before a real note develops where the LAT is the logarithm of the time duration between the point where the signal starts to the point it reaches its stable part.[peeters et al., 2000] The range of the LAT is 1 defined as log 10 ( ) and is determined by the length of the signal. Struck samplingrate instruments, such a most percussive instruments have a short LAT whereas blown or vibrated instruments contain LATs of a longer duration. LAT = log 10 (T 1 T 0), (6.1) where T 0 is the time the signal starts; and T 1 is reaches its sustained part (harmonic space) or maximum part (percussive space) AudioHarmonicityType (HRM) The motivation for using the MPEG-7 descriptor, AudioHarmonicityType is that it describes the degree of harmonicity of an audio signal.[j.m.martinez and R. Koenen, 2001]

70 56 Most percussive instruments contain a latent indefinite pitch that confuses and causes exceptions to parameters set forth in Hornbostel-Sachs. Furthermore, some percussive instruments such as a cuica or guido contain a weak LogAttackTime and therefore fall into non-percussive cluster while still maintaining an indefinite pitch (although, we can perceive differences in contents of low and high frequencies in percussive sounds as well). The use of the descriptor AudioHarmonicityType theoretically should solve this issue. It includes the weighted confidence measure, SeriesOfScalar- Type that handles portions of signal that lack clear periodicity. AudioHarmonicity combines the ratio of harmonic power to total power: HarmonicRatio, and the frequency of the inharmonic spectrum: UpperLimitOfHarmonicity. First: We make the Harmonic Ratio H(i) the maximum r(i,k) in each frame, i where a definitive periodic signal for H(i) =1 and conversely white noise = 0. H(i) =max r(i, k) (6.2) where r(i,k) is the normalised cross correlation of frame i with lag k: r(i, k) = m+n 1 j=m s(j) s(j k) /( m+n 1 j=m s(j) 2 m+n 1 j=m s(j k) 2 ) 1 2 (6.3) where s is the audio signal, m=i*n, wherei=0, M 1=frame index and M =the number of frames, n=t*sr, wheret = window size (10ms) and sr = sampling rate, k=1, K=lag, wherek=ω*sr, ω = maximum fundamental period expected (40ms) Second: Upon obtaining the i) DFTs of s(j) and comb-filtered signals c(j) in the AudioSpectrumEnvelope and ii) the power spectra p(f) andp (f) in the AudioSpectrumCentroid we take the ratio f lim and calculate the sum of power beyond the frequency for both s(j) and c(j): a(f lim )= f max f=f lim p (f) / fmax f=f lim p(f) (6.4)

71 Figure 6.2: Five levels of sustainability to severe dampening where f max is the maximum frequency of the DFT. Third: Starting where f lim = f max we move down in frequency and stop where the greatest frequency, f ulim s ratio is smaller than 0.5 and convert it to an octave scale based on 1 khz: UpperLimitOfHarmonicity = log2(f ulim /1000) (6.5) Sustainability (S) We define sustainability into 5 categories based on the degree of dampening or sustainability the instrument can maintain over a maximum period of 7 seconds. For example, a flutist, horn player and violinist can maintain a singular note for more than 7 seconds therefore they receive a 1. Conversely a plucked guitar or single drum note typically cannot sustain that one sound for more than 7 seconds. It is true that a piano with pedal could maintain a sound after ten seconds but the sustainability factor would be present Experiments The sound data consists of a sample set of 156 signals extracted from our online database at which contains 6,300 segmented sounds mostly from MUMS audio CD s that contain samples of broad range of musical instruments, including orchestral ones, piano, jazz instruments, organ, etc. [Opolko and Wapnick, 1987] These CD s are widely used in musical instrument sound research [Cosi et al., 1 98,

72 58 Martin and Kim,, Wieczorkowska, 375, Fujinaga and McMillan, 2000, Kaminskyj, 2000b, Eronen and Klapuri, 2000], so they can be considered as a standard. The database consists of 188 samples each representing just one sample from group that make up the 6,300 files in the database. Mums divides the database into the following 18 classes: violin vibrato, violin pizzicato, viola vibrato, viola pizzicato, cello vibrato, cello pizzicato, double bass vibrato, double bass vibrato, double bass pizzicato, flute, oboe, b-flat clarinet, trumpet, trumpet muted, trombone, trombone muted, French horn, French horn muted, and tuba. Preprocessing these groups is not a part of RSES, used to extract knowledge from data because rough sets require that input data process the rough sets. Rough set are objective with respect to its data. Here we discretize, using MPEG-7 classifiers as the experts. This is the point of the thesis, we show a novel, empirical methodology of dividing sounds conducive to automatic retrieval of music. Testing: c4.5 The principle objective of our testing is to prove how parameter-based classification differs, and when used on Sachs-Hornbostel - improves Sachs-Hornbostel. Our parameters are machine-based, based on MPEG-7 and the temporal signal dampening. It is not based upon humanistic intuitiveness. To induce the classification rules in the form of decision trees from a set of given examples we used Quinlan s C4.5 algorithm. [Quinlan, 1996] The algorithm constructs a decision tree to form production rules from an unpruned tree. Next a decision tree interpreter classifies items which produces the rules. We used Bratko s Orange software [Demsar et al., ] and implement C4.5 with scripting in Python. HRM, LAT, S, with HS01 The first test comprised the testing of the decision attribute Sachs-Hornbostel-level-1 against our two MPEG-7 descriptors, Harmonicity (HRM), Log Attack (LAT) and our

73 59 temporal feature Sustainability (S). The Sachs-Hornbostel-level-1 attribute consists of four classes based upon human intuitiveness: aerophones, idiophones, chordophones and membranophones. See Appendix Figure 6.3 HRM, LAT, S, with HS02 The second test comprised the testing of the decision attribute Sachs-Hornbostel-level- 2 against the HRM, LAT and S descriptors. The Sachs-Hornbostel-level-2 attribute consists of four classes: aerophones, idiophones, chordophones and membranophones. See Appendix Figure 6.4 HRM, LAT, S, with Instruments The third test comprised the testing of the decision attribute instruments against the HRM, LAT and S descriptors. The Instrument attribute consists of four classes that describe instruments in the manner machines look at their signals: percussion, blown, string and struck Harmonics. See Appendix Figure 6.5 Resulting Tree The resulting tree shows how the sound objects are grouped, and we can compare how this classification differs from Sachs-Hornbostel system. The misclassified objects show discrepancies between the Sachs-Hornbostel system, and sound properties described by physical attributes such as seen with TeX discrepancies of piano and tambourine.the novelty of this methodology is that adding the temporal feature and grouping the instruments from the machines point of view have lead to 83% correctness. We have 26 more MPEG-7 descriptors to use with this methodology to breakdown the 17% misclassified Testing: Rough Sets Using the same attributes, LAT symbolic, HR symbolic, S symbolic and we found rules with a minimum of 90% confidence with support of no less than 6 using RSES.

74 60 Table 6.1: Calculating rules for S with a minimum of 90% confidence with support of no less than 6 using RSES LAT HR S Support ,Inf AND ,Inf ,Inf AND ,Inf ,Inf AND , , AND ,Inf , AND -Inf, ,Inf AND , , AND , , AND , , AND -Inf, Inf, AND -Inf, Table 6.2: Calculating rules for Articulation with a minimum of 90% confidence with support of no less than 6 using RSES LAT HR S art Support ,Inf AND ,Inf AND1 blown , AND ,Inf AND1 blown , AND -Inf, AND 4 percussion 9 -Inf, AND -Inf, AND4 percussion 9 We discretized and generated rules by the LEM2 algorithm generated 9 rules as shown in tables 6.1 and 6.2: C4.5 generated 19 rules that included 4 rules operating under the same parameters of 90% confidence with support of no less than 6. Comparing the 4 C4.6 rules with the 4 RSES, LERS-based set rules indicates that the rules generated demonstrate a robust tree. The issue will be one of time once we begin to use the large datasets.

75 Figure 6.3: C4.5 results testing the decision attribute Sachs-Hornbostel-level-1 against our two MPEG-7 descriptors, Harmonicity (HRM), Log Attack (LAT) and our temporal feature Sustainability (S). S is divided at the <2.000 and >=2.000 node, Harmonicity is divided at < and >= for S <2.000 whereas, at >=2.000 LAT cuts the tree at LAT < and >=

76 Figure 6.4: C4.5 results testing the decision attribute Sachs-Hornbostel-level-2 against our two MPEG-7 descriptors, Harmonicity (HRM), Log Attack (LAT) and our temporal feature Sustainability (S) 62

77 Figure 6.5: C4.5 results testing of the decision attribute instruments against the HRM, LAT and S descriptors. The Class files indicate whether the instruments are percussive, blown, string or struck harmonics. 63

78 CHAPTER 7: MUSIC SIGNAL SEPARATING 7.1 Introducing a New Approach for Music based on KDD Polyphonic Pitches and Instrumentations Given that MIRAI will eventually have to have the ability to read complex music, MIRAI needs to be able to pinpoint a sound within a multitude of sounds in a song recordation. Our approach changed course when we realized the need to identify pitches and instrumentations in sounds where more than one instrument was playing called Polyphonic meaning more than one sound. See [Lewis et al., 2007b] The reason for the experiments was that we knew that pitch and timbre detection methods applicable to monophonic digital signals are common. However, successful detection of multiple pitches and timbres in polyphonic time-invariant music signals remained a challenge. We reviewed these methods, sometimes called Blind Signal Separation, and present in this Thesis for the purpose analyzing how musically trained human listeners overcome resonance, noise, and overlapping signals to identify and isolate what instruments are playing and then what pitch each instrument is playing. The part of the instrument and pitch recognition system, presented in this thesis, responsible for identifying the dominant instrument from a base signal uses temporal features proposed by Wieczorkowska [Slezak et al., ] in addition to the standard 11 MPEG7 features. After retrieving a semantical match for that dominant instrument from the database, it creates a resulting foreign set of features to form a new synthetic basen signal which no longer bears the previously extracted dominant sound. The system may repeat this process until all recognizable dominant instruments are accounted for in the segment. The proposed methodology incorporates Knowledge

79 65 Discovery, MPEG7 segmentation and Inverse Fourier Transforms. Blind Signal Separation (BSS) and Blind Audio Source Separation (BASS) have recently emerged as the subjects of intense work in the fields of Signal Analysis and Music Information Retrieval. This thesis focuses on the separation of harmonic signals of musical instruments from a polyphonic domain for purpose of music information retrieval. First, it recognizes the state of the art in the fields of signal analysis. Particularly, Independent Component Analysis and Sparse Decompositions. Next it reviews music information retrieval systems that blindly identify sound signals. Herein we first present a new approach to the separation of harmonic musical signals in a polyphonic time-invariant music domain and then secondly, the construction of new correlating signals which include the inherent remaining noise. These signals represent new objects which when included in the database, with continued growth, improve the accuracy of the classifiers used for automatic indexing Signal Analysis In 1986, Jutten and Herault proposed the concept of Blind Signal Separation 1 as a novel tool to capture clean individual signals from noisy signals containing unknown, multiple and overlapping signals [Herault and Jutten, 1986]. The Jutten and Herault model comprised a recursive neural network for finding the clean signals based on the assumption that the noisy source signals were statistically independent. Researchers in the field began to refer to this noise as the cocktail party property, as in the undefinable buzz of incoherent sounds present at a large cocktail party. By the mid 1990 s researchers in neural computation, finance, brain signal processing, general biomedical signal processing and speech enhancement, to name a few, embraced the algorithm. Two models dominate the field; Independent Component Analysis (ICA) [Cardose, 1998] and Sparse Decompositions (SD) [Pearlmutter and Kisilev, 2001]. ICA originally began as a statistical method that 1 See Appendix A.

80 expressed a set of multidimensional observations as a combination of unknown latent 66 variables [Herault and Jutten, 1986]. The principle idea behind ICA is to reconstruct these latent, sometimes called dormant, signals as hypothesized independent sequences where k = the unknown independent mixtures from the unobserved independent source signals: x = f(θ,s), (7.1) where x =(x 1,x 2,..., x m ) is an observed vector and f is a general unknown function with parameters Θ [Bingham, 2003] that operates the variables listed in the vector s =(s 1,..., s n ) s(t) =[s 1 (t),..., s k (t)] T. (7.2) Here a data vector x(t) is observed at each time point t, such that given any multivariate data, ICA can decorrelate the original noisy signal and produce a clean linear co-ordinate system using: x(t) =As(t), (7.3) where A is a n k full rank scalar matrix. For instance (Fig. 7.1), if a microphone receives input from a noisy environment containing a jet fighter, an ambulance, people talking and a speaker-phone, then x i (t) =a i1 s 1 (t)+a i2 s 2 (t)+a i3 s 3 (t)+a i4 s 4 (t). In this case we are using i = 1 : 4 ratio. Rewriting it in a vector notation, it becomes x = A s. For example, looking at a two-dimensional vector x =[x 1 x 2 ] T ICA finds the decomposition: x 1 x 2 = a 11 a 21 s 1 + a 12 a 22 s 2 (7.4) x = a 1 s 1 + a 2 s 2 (7.5) where a 1,a 2 are basis vectors and s 1,s 2 are basis coefficients. Sparse decomposition was first introduced in the field of image analysis by Field and

81 67 Figure 7.1: A noisy cocktail party Olshausen[Olshausen and Field, 1996]. Nowadays, the most general SD algorithm is probably Zibulevsky s where his resulting optimization is made on two factors based on the output vector s entropy and sparseness. Similar to ICA, in SD, the resulting signalx(t) isthesumoftheunknownn k matrixa andnoiseξ(t), where n represents the sensors and k represents the unknown scalar source signals.: x(t) =As(t)+ξ(t). (7.6) The signals are sparsely represented in a signal dictionary [Zibulevsky and Pearlmutter, 2000]: k s i (t) = C ikϕk (t), (7.7) k=1 where the ik and ϕk represent the atoms of the dictionary. In the field of Music Information Retrieval systems, algorithms that analyze polyphonic time-invariant music signals systems operate in either the time domain [et al, 1997], the frequency domain [Smaragdis, 1998] or both the time and frequency domains simultaneously [Lambert and Bell, 1997]. Kostek takes a different approach and instead divides BSS algorithms into either those operating on multichannel or single channel sources. Multichannel sources detect signals of various sensors whereas single channel sources are typically harmonic [et al, 2005]. For clarity, let it be said that experiments provided herein switch between the time and frequency domain, but more

82 68 importantly, per Kostek s approach, our experiments fall into the multichannel category because, at this point of experimentation two harmonic signals are presented for BSS. In 2000, Fujinaga and MacMillan created a real time system for recognizing orchestral instruments using an exemplar-based learning system that incorporated a k nearest neighbor classifier (k-nnc) [Fujinaga. and MacMillan., 2000] using a genetic algorithm to recognize monophonic tones in a database of 39 timbres taken from 23 instruments. Also, in 2000, Eronen and Klapuri created a musical instrument recognition system that modeled the temporal and spectral characteristics of sound signals [Eronen and Klapuri, 2000]. The classification system used thirty-two spectral and temporal features and a signal processing algorithms that measured the features of the acoustic signals. The Eronen system was a step forward in BSS because the system was pitch independent and it successfully isolated tones of musical instruments using the full pitch range of 30 orchestral instruments played with different articulations. Also, both hierarchic and direct forms of classification were evaluated using 1498 test tones obtained from the McGill University Masters Samples (MUMs) CDs including home made recordings from amateur musicians. In 2001 Zhang constructed a multi-stage system that segmented the music into it individual notes, found the harmonic partial estimation from a polyphonic source and then normalized the features for loudness, length and pitch [Zhang, 2001]. The features included the 1) temporal features accounting for rising speed, degree of sustaining, degree of vibration, and releasing speed, 2) spectral features accounting for the spectral energy distribution between low, middle and high frequency sub-bands and the partial harmonic such as brightness, inharmonicity, tristimulus, odd partial ratio, irregularity and dormant tones. Zhang s system successfully identified instruments playing in a polyphonic music pieces. In one the polyphonic source contained 12 instruments including, cello, viola, violin, guitar, flute, horn, trumpet, piano, organ, erhu, zheng, and sarod. The significance of Zhang s system was in the manner it used artificial neural networks

83 69 to find the dominant instrument: First it segmented each piece into notes and then categorized the music based on the what instrument played the most notes. It then weighted this number by the likelihood value of each note when it is classified to this instrument. For example, if all the notes in the music piece were grouped into K subsets: I 1 ; I 2 ;...I K, with I i corresponding to the ith instrument, then a score for each instrument was computed as: s Ii = x I i O i (x), i =1 k (7.8) where x denotes a note in the music piece, and O i (x) is the likelihood that x will be classified to ith instrument. Next, Zhang normalized the score to satisfy the following condition: s Ii = k s(i i ) = 1 (7.9) i=1 It is interesting to note the similarity between this and Zibulevsky s Eq.07 infra. Zhang used 287 music monophonic and polyphonic pieces and he reached an accuracy of 80 % success in identifying the dominant instrument and 90 % if intra-family confusions were able to be dismissed. Classification of the Zhang s system incorporated a Kohonen self-organizing map to select the optimal structure of each feature vector. In 2002, Wieczorkowska, collaborated with Slezak, Wróblewski and Synak [Slezak et al., ] and used MPEG-7 based features to create a testing database for training classifiers used to identify musical instrument sounds. She used seventeen MPEG-7 temporal and spectral descriptors observing the trends in evolution of the descriptors over the duration of a musical tone, their combinations and other features. Wieczorkowska compared the classification performance of the knnc and rough set classifiers using various combinations of features. Her results showed that the knnc classifier outperformed, by far, the rough set classifiers. In 2003, Eronen and Agostini both tested, in separate tests, the viability of using decision tree classifiers in

84 70 Music Information retrieval. They both found that decision tree classifiers ruined the classification results: Eronen s system recognized groups of musical instruments from isolated notes using Hidden Markov Models [Eronen and Klapuri, 2000]. Eronen classified the instruments into groups such as strings or woodwinds, not as individual instruments. Agostini s system [Longar and Pollastri, 2003] tested a monophonic base of 27 instruments using eighteen temporal and spectral features with a number of classification procedures to determine which procedure worked most effectively. The experimentation used a number of classical methods including canonical discriminant analysis, quadratic discriminant analysis and support vector machines. Agostini s Support Vector tests yielded a 70 % accuracy on individual instruments. Groups of instruments yielded 81% accuracy. As in this thesis s experiments, Agostini s classifiers were MPEG-7 based. The experiments used 18 descriptors for each tone to compute mean and standard deviation of 9 features over the length of each tone. Agostini s system used a 46 ms window for the zero-crossing rate to procure measurements directly from the waveform as the number of sign inversions. To obtain a useable number of harmonics a pitch tracking algorithm controlled each signal by first analyzing it at a low-frequency and repeating it at smaller resolutions until a sufficient number of harmonics was estimated. Interestingly, they used a variable window size to obtain a frequency resolution of at least 1/24 of octaves. The team evaluated the harmonic structure of their signals with FFT s using half-overlapping windows. In 2004, Kostek developed a 3-stage classification system that successfully identified up to twelve instruments played under a diverse range of articulations [Kostek, 729]. The manner in which Kostek designed her stages of signal preprocessing, feature extraction and classification may prove to be the standard in BSS MIR. In the preprocessing stage Kostek incorporates 1) the average magnitude difference function and 2) Schroeder s histogram for purposes of pitch detection. Her feature extraction stage extracts three distinct sets of features: Fourteen FF1 based features,

85 71 MPEG-7 standard feature parameters and wavelet analysis. In the final stage, for classification, Kostek incorporates a multi layer ANN classifier. Importantly, Kostek concluded that she retrieved the strongest results when employing a combination of both MPEG- 7 and wavelet features. Also the performance deteriorated as the number of instruments increased Our BSS Experiments Stepping back and reviewing Kostek, Zhang and Agostini, it became apparent to the authors that BSS works diametrically in opposition to the manner in which trained human listeners segment polyphonic sources of music. When presented with a polyphonic source signal, trained humans overcome resonance, noise and the complexity of instruments playing simultaneously to identify and isolate what instruments are playing and then also identify what pitch each instrument is playing. The basis for the BSS system presented in this thesis began by the authors thinking very carefully on how humans, versus classical MIR systems, identify sounds in polyphonic sources. Here a small, anecdotal test formed the seed for the system presented herein. In the Spring of 2006, in order to get a sense of how humans listen to music, one of the authors, Lewis, took an original piece of music he composed and performed with his band, changed it slightly and tested the band members as follows accordingly. Lewis knew these results would be anecdotal and non scientific but he was intrigued by what the outcome would be. Lewis knew that each band member was very familiar with the song and with the instrumentation of the song because they were present when Lewis composed the song, they recorded it over the course of weeks in a studio and they performed the song live in front of audiences many hundreds of times. Essentially, each member knew the song intimately. Lewis made four new versions: Version 1 omitted the kick drum and symbol on drum tracks. Version 2 changed bass notes and omitted some bass notes. Version 3 swapped horn sections around and changed the pitch of the horn at six sections. Finally in Version 4, Lewis extracted

86 72 the guitar piece and inserted three never before played chords into the song. Lewis asked each member to listen to the three versions of the song - except for the version in which Lewis changed the instrument in which the listener played. For example, Version 3 contained changes to the horn section, here the horn player listened to Version1,2, and 4, not version 3 where he would immediately here his horn solo s were swapped. As the horn player listened to versions 1,2 and 4 he began to get bored. Upon being asked to listen carefully to see what was changed, he could not here the missing drum tracks on version 1, the missing and changed bass guitar on version 2 or the changed guitar tracks on version 4. In fact each member of the band could not hear any changes to other instruments even when asked specifically to listen to them - except for one instance, the bass player identified one of the 14 changes in the guitar track and asked if it was an earlier version where Lewis played the guitar track differently. We concluded that trained musicians practically block out instruments they are not interested in. The bass player was interested in one particular guitar sections because he cued one of his solos off of the timing of the missing note. At this moment, he would tune into the guitar and then block it out as he played his solo. The issue became: How do musicians block out sound? How do New Yorker s block out the constant horn honking, ambulance and police sirens so they can fall asleep, or, conversely, how do farmers block out animal sounds so they can fall asleep? The answer, for purposes of this thesis is, we do not know how humans block out sound but clearly - they do. More so, even with an in depth study of Kostek, Zhang an Agostini, the system developed transmute the signal into frequency domains and manipulate it but focusing on the dominant timbres, pitches, cepstrums, tristimuluses and frequencies, to name a few. The common factor in all of the above is that only the original sound source is used. In other words, non of the above insert into the equation a foreign entity - as humans probably do. Also, in non of the above approaches we train the classifiers

87 73 using artificial samples of music objects produced by MIR system. A human that has never heard a South African Zulu Penny Whistle, cannot - not hear it until he or she has heard it a few times. Typically Lewis band members, like most experienced musicians in bands, can hear a song, listen to the counter instruments playing in the song and play it almost immediately. Except when the humans have not heard an instrument that they normally would block out. This became evident when Lewis brought back to the USA, recordings of songs he purchased in Johannesburg. The band members were not able to focus on anything, let alone their own instrument parts, because of the new instrument, the Zulu Penny Whistle could not be blocked out. Why? We believe the answer lies in the fact that because the band member s had never heard a Zulu Penny Whistle, they had no past data of Zulu Penny Whistle Sounds that would be used to block them out and enable them to focus on their counterpart in the song. Again this lead the authors to believe that humans use a set of sounds in their heads to block out noise in a song so they can focus on exactly the portion of the song they want to listen to. The seminal question the authors asked is the same question that lead them to develop the system presented in this thesis which is a system that uses a foreign entities to block out signals in polyphonic signals. In short, when the system reads a polyphonic source, it identifies a dominant aspect of the polyphonic source, finds its match in the database and inserts this foreign entity into the polyphonic source, to do what humans do, i.e., block the portion of the original sound not interested in. To perform the experiments, the system analyzes 4 separate versions of a polyphonic source (see samples in figures 3 to 6 below) containing two harmonic continuous signals obtained from the McGill University Masters Samples (MUMs) CDs. These samples contain a mix of samples one and two, with various levels of noise. Specifically, the first sample contains a C at octave 5 played on a nine foot Steinway, recorded at 44,100HZ, in 16-bit stereo. (Fig. 7.2) The second sample contains an A at octave 3 played on a Bb Clarinet,

74 Figure 7.2: 5C Piano @ 44,100Hz, 16 bit, stereo Figure 7.3: 3A Bb Clarinet @ 44,100Hz, 16 bit, stereo recorded at 44,100HZ, in 16-bit stereo. (Fig. 7.3) The third sample contains a mix of the first and second samples with no noise added, using Sony s Sound Forge 8.

4) Similarly, the fourth sample contains a mix of the first and second samples with noise added at -17.8 db (-12.88 %)(Fig. 7.5).

88 74 Figure 7.2: 5C 44,100Hz, 16 bit, stereo Figure 7.3: 3A Bb 44,100Hz, 16 bit, stereo recorded at 44,100HZ, in 16-bit stereo. (Fig. 7.3) The third sample contains a mix of the first and second samples with no noise added, using Sony s Sound Forge 8.0 and containing a pure mix recorded at 44,100HZ, in 16-bit stereo. (Fig. 7.4) Similarly, the fourth sample contains a mix of the first and second samples with noise added at db ( %)(Fig. 7.5). The fifth sample contains a mix of the first and second samples with noise added at db (-1.58 %)(Fig. 7.6). Finally, the sixth sample contains a mix of the first and second samples with noise added at -8.5 db ( %)(Fig. 7.7).

89 75 Figure 7.4: Piano and 44,100Hz, 16 bit, stereo - No Noise Figure 7.5: Piano and 44,100Hz, 16 bit, stereo - 01 Noise at db ( %) Figure 7.6: Piano and 44,100Hz, 16 bit, stereo - 02 Noise at db (-1.58 %)

76 Figure 7.7: Piano and Clarinet @ 44,100Hz, 16 bit, stereo - 03 Noise at -8.5 db (-37.58 %) 7.

90 76 Figure 7.7: Piano and 44,100Hz, 16 bit, stereo - 03 Noise at -8.5 db ( %) 7.2 Initial Blind Signal Experiments In explaining the system procedures reference will be made to the two foreign samples housed in the database (Fig. 7.2) (Fig. 7.3) containing the piano 5c and clarinet 3a. The polyphonic input to the system will consist of the four variations of the mix of piano 5c and clarinet 3a. For the purpose of this discussion it is also assumed that the clarinet 3a is the dominant feature of all four variations of the mix. The system reads the input and uses an FFT to transform it into the frequency domain. In the frequency domain it determines that the fundamental frequency of 3a with a woodwind-like timbre is dominant Fig. ( 7.8). The system searches the database and first extracts all 3a pitches of each instrument. Next it separates all woodwind-like sounds in the 3a temporary cache. At this point it uses the MPEG-7 descriptors based classifier to find 3a clarinet as close to the one identified. Here it extracts the wave of 3a clarinet and performs a FFT on this, a foreign sound entity. It subtracts the resultant of the foreign entities FFT from the input entities FFT leaving an FFT, that when subjected to an IFFT produces a wave that contains only the piano 5c, resonance, harmonics and other negligible noise. In considering the use of MPEG-7, we recognized that a sound segment containing musical instruments may have three states: transient, quasi-steady and decay.

77 Figure 7.8: Theoretical Procedure: Subtracting a foreign, extracted signal s FFT from the source FFT Identifying the boundary of the transient state enables accurate timber recognition.

91 77 Figure 7.8: Theoretical Procedure: Subtracting a foreign, extracted signal s FFT from the source FFT Identifying the boundary of the transient state enables accurate timber recognition. Wieczorkowska presented a timbre detection system in [Wieczorkowska et al., 2003a] where she split each sound segment into 7 equal intervals. Because different instruments require different lengths, Ras and Zhang used a new approach to look at the time it takes for the transient duration to reach the quasi-steady state of the fundamental frequency [Ras and Zhang, 2006]. It is estimated by computing the local cross-correlation function of the sound object and the mean time to reach the maximum within each frame. Our system developed herein is based on the following MPEG-7 Descriptors. The AudioSpectrumCentroid is a description of the center of gravity of the log-frequency power spectrum. Spectrum centroid is an economical description of the shape of the power spectrum. It indicates whether the power spectrum is dominated by low or high frequencies and, additionally, it is correlated with a major perceptual dimension of timbre; i.e.sharpness. To extract the spectrum centroid: 1. Calculate the power spectrum coefficients; 2. Power spectrum coefficients below 62.5 Hz are replaced by a single coefficient, with power equal to their sum and a nominal frequency of Hz; 3. Frequencies of all coefficients are scaled to an octave scale anchored at 1 khz. The AudioSpectrumSpread is a description of the spread of the log-frequency power spectrum. Spectrum spread is an economical descriptor of the

Mining Chordal Semantics in a Non-Tagged Music Industry Database.

Intelligent Information Systems 9999 ISBN 666-666-666, pages 1 10 Mining Chordal Semantics in a Non-Tagged Music Industry Database. Rory Lewis 1, Amanda Cohen 2, Wenxin Jiang 2, and Zbigniew Ras 2 1 University