Segmentation of musical items: A Computational Perspective

Size: px

Start display at page:

Download "Segmentation of musical items: A Computational Perspective"

Jacob Smith
5 years ago
Views:

1 Segmentation of musical items: A Computational Perspective A THESIS submitted by SRIDHARAN SANKARAN for the award of the degree of MASTER OF SCIENCE (by Research) DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING INDIAN INSTITUTE OF TECHNOLOGY, MADRAS. Oct 217

2 THESIS CERTIFICATE This is to certify that the thesis entitled Segmentation of musical items: A Computational Perspective, submitted by Sridharan Sankaran, to the Indian Institute of Technology, Madras, for the award of the degree of Master of Science (by Research), is a bonafide record of the research work carried out by him under my supervision. The contents of this thesis, in full or in parts, have not been submitted to any other Institute or University for the award of any degree or diploma. Dr. Hema A. Murthy Research Guide Professor Dept. of Computer Science and Engineering IIT-Madras, 6 36 Place: Chennai Date:

3 ACKNOWLEDGEMENTS I would like to express my sincere gratitude to my guide, Prof. Hema A. Murthy, for the excellent guidance, patience and for providing me with an excellent atmosphere for doing research. She helped me to develop my background in signal processing and machine learning and to experience the practical issues beyond the textbooks. The endless sessions that we had about research, music and beyond have not only helped in improving my perspective towards research but also towards life. I would like to thank my collaborator Krishnaraj Sekhar PV. The completion of this thesis would not have been possible without his contribution. He helped me in building datasets, carrying out the experiments, analyzing results and in writing research papers. thesis. Thanks to Venkat Viraraghavan, Jom and Krishnaraj for proof reading this I am grateful to the members of my General Test Committee, for their suggestions and criticisms with respect to the presentation of my work. I am also grateful for being a part of the CompMusic project. It was a great learning experience working with the members of this consortium. I would like to thank Dr Muralidharan Somasundaram, my guide at Tata Consultancy Services for making maths look simple. I am grateful to Prof V.Kamakoti, who encouraged me to pursue this programme at IIT and connected me to my guide. i

4 I would like to thank my employer Tata Consultancy Services for sponsoring me for this external programme and accommodating my absence from work whenever I am at the institute. I would like to thank Anusha, Jom, Karthik, Manish, Padma, Praveen, Raghav, Sarala, Saranya, Shrey, and other members of Donlab for their help and support over the years. It would have been a lonely lab without them. I am also obliged to the European Research Council for funding the research under the European Unions Seventh Framework Program, as part of the CompMusic [14] project (ERC grant agreement ). I would like to thank my family for their support and for tolerating my noncooperation at home citing my academic pursuits. ii

5 ABSTRACT KEYWORDS: Carnatic Music, Pattern matching, Segmentation, Query, Cent filter bank Carnatic music is a classical music tradition widely performed in the southern part of India. While the Western classical music is primarily polyphonic, meaning different notes are sounded at the same time to create harmony, Carnatic music is essentially monophonic, meaning only single note is sounded at a time. Carnatic music focusses on expanding those notes and expounding the melody aspect and emotional aspect. Carnatic music also gives importance to (on-the-stage) manodharma (improvisations). Carnatic music, which is one of the two styles of Indian classical music, has rich repertoires with many traditions and genres. It is primarily an oral tradition with minimal codified notations. Hence it has well established teaching and learning practices. Carnatic music has hardly been archived with the objective of music information retrieval (MIR). Neither it has been studied scientifically until recently. Since Carnatic music is rich in manodharma, it is difficult to analyse and represent adopting techniques used for Western music. With MIR, there are many aspects that can be analysed and retrieved from a Carnatic music item such as the rāga, tāla, the various segments of the item, the rhythmic strokes used by the percussion instruments, the rhythmic patterns used etc. Any such MIR task will be of great benefit not only to enhance the listening pleasure but also will serve as a learning iii

6 aid for students. In Carnatic music, musical items are made up of multiple segments. The main segment is the composition (kriti) which has melody, rhythm and lyrics and it can be optionally preceded by pure melody segment (ālāpanā) without lyrics or beats (tālam). The ālāpanā segment, if present, will have a sub-segment rendered by vocalist optionally followed by a sub-segment rendered by the accompanying violinist. The kriti in turn is generally made of three sub-segments - pallavi, anupallavi and caranam. The goal of this thesis is to segment a musical item into its various constituent segments and sub-segments mentioned above. We first attempted to segment the musical item into ālāpanā and kriti using an information theoretic approach. Here, the symmetric KL divergence (KL2) distance measure between ālāpanā segment and kriti segment was used to identify the boundary between ālāpanā and kriti segments. We got around 88% accuracy in segmenting between ālāpanā and kriti. Next we attempted to segment the kriti into pallavi, anupallavi and caranami using pallavi (or part of it) as the query template. A sliding window approach with time-frequency template of the pallavi that slides across the entire composition was used and the peaks of correlation were identified as matching pallavi repetitions. Using these pallavi repetitions as the delimiter, we were able to segment the kriti with 66% accuracy. In all these approaches, it was observed that Cent filterbank based features provided better results than traditional MFCC based approach. iv

7 TABLE OF CONTENTS ACKNOWLEDGEMENTS ABSTRACT i iii LIST OF TABLES LIST OF FIGURES ABBREVIATIONS 1 Introduction Overview of the thesis Music Information retrieval Carnatic Music - An overview Rāga Tāla Sāhitya Carnatic Music vs Western Music Harmony and Melody Composed vs improvised Semitones, microtones and ornamentations Notes - absolute vs relative frequencies Carnatic Music - The concert setting Carnatic Music segments Composition Ālāpanā Contribution of the thesis Organisation of the thesis

8 2 Literature Survey Introduction Segmentation Techniques Machine Learning based approaches Non machine learning approaches Audio Features Temporal Features Spectral Features Cepstral Features Distance based Features Discussions CFB Energy Feature CFB Slope Feature CFCC Feature Identification of ālāpaā and kriti segments Introduction Segmentation Approach Boundary Detection Boundary verification using GMM Label smoothing using Domain Knowledge Experimental Results Dataset Used Results Discussions Segmentation of a kriti Introduction Segmentation Approach Overview Time Frequency Templates

9 4.3 Experimental Results Finding Match with a Given Query Automatic Query Detection Domain knowledge based improvements Repetition detection in a RTP Discussions Conclusion Summary of work done Criticism of the work Future work

10 LIST OF TABLES 1.1 Differences in frequencies of the 12 notes for Indian Music and Western Music Division of dataset Confusion matrix: Frame-level labelling Performance: Frame-level labelling Confusion matrix: Item Classification Performance: Item Classification Confusion matrix: Frame-level labelling Performance: Frame-level labelling Confusion matrix: Item Classification Performance: Item Classification Comparison between various features Manual vs automatic query extraction (CFB Energy: Cent filter bank cepstrum, CFB Slope: Cent filterbank energy slope). Time is given in seconds

11 LIST OF FIGURES 1.1 A typical Carnatic music concert Tonic normalisation of two similar phrases Typical melodic variations in repetitions Pitch histogram of rāaga Sankārabharanam with its Hindustani and Western classical equivalents Effect of gamakas on pitch trajectory Concert Segmentation Item segmentation Block diagram of MFCC extraction Block diagram of HCC analysis Filter-banks and filter-bank energies of a melody segment in the mel scale and the cent scale with different tonic values KL2 Values and possible segment boundaries GMM Labels Entire song label generated using GMM Entire song label generated using GMM after smoothing Time-frequency template of music segments using FFT specturm (X axis: Time in frames, Y axis: Frequency in Hz) Time-frequency template of music segments using cent filterbank energies (X axis: Time in frames, Y axis: Filter) Time-frequency template of music segments using cent filterbank slope (X axis: Time in frames, Y axis: Filter) Correlation as a function of time (cent filterbank energies) Correlation as a function of time (cent filterbank slope) Spectrogram of query and matching segments as found out by the algorithm Query matching with Cent filterbank slope feature

12 4.8 Query matching with Chroma feature (no overlap) Query matching with Chroma feature (with overlap) Query matching with MFCC feature Intermediate output (I) of the automatic query detection algorithm using slope feature Intermediate output (II) of the automatic query detection algorithm using slope feature Final output of the automatic query detection algorithm using slope feature Correlation for full query Vs. half query False positive elimination using rhythmic cycle information Repeating pattern recognition in other Genres Normal tempo Half the original tempo tisram tempo Double the original tempo

13 ABBREVIATIONS AAS BFCC BIC CFCC CNN DCT DFT EM FFT GIR GMM HMM IDFT KL Divergence LP LPCC MFCC MIR PSD RMS STE Automatic Audio Segmentation Bark Filterbank Cepstral Coefficients Bayesian Information Criteria Cent Filterbank Cepstral Coefficients Convolutional Neural Networks Discrete Cosine Transform Discrete Fourier Transform Expectation Maximisation Fast Fourier Transform Generalised Likelihood Ratio Gaussian Mixture Model Hidden Markov Model Inverse Discrete Fourier Transform KullbackLeibler divergence Linear Prediction Linear Prediction Cepstral Coefficients Mel Filterbank Cepstral Coefficients Music Information Retrieval Power Spectral Density Root Mean Square Short Term Energy

14 SVM t-f ZCR Support Vector Machine Time Frequency Zero Crossings Rate

15 CHAPTER 1 Introduction 1.1 Overview of the thesis Carnatic music is a classical music tradition performed largely in the southern states of India namely Tamil Nadu, Kerala, Karnataka, Telangana and Andhra Pradesh. Carnatic music and Hindustani music form the two sub genres of Indian classical music, the latter being more popular in the Northern states of India. Though the origin of Carnatic and Hindustani music can be traced back to the theory of music written by Bharata Muni around 4 BCE, these two sub-genres have evolved differently over a period of time due to the prevailing socio-political environments in various parts of India, but still retaining certain core principles in common. In this work, we will focus mainly on Carnatic music, though some of the challenges and approaches to MIR described under Carnatic music are applicable to Hindustani music also. Rāga (melodic modes), tāla (repeating rhythmic cycle) and sāhitya (lyrics) form the three pillars on which Carnatic music rests. The concept of rāga is central to Carnatic music. While it can be grossly approximated to a scale in Western music, in reality a rāga encompasses collective expression of melodic phrases that are formed due to movement or trajectories of notes that conform to the grammar of that rāga. The trajectories themselves are defined variously by gamakas which are movement, inflexion and ornamentation

16 of notes [4, Chapter 5]. While a note corresponds to a pitch position in Western music, a note in Carnatic music (called the svara) need not be a singular pitch position but a pitch contour or a pitch trajectory as defined by the grammar of that rāga. In other words, a note in Western music corresponds to a point value in a time- frequency plane while a svara in Carnatic music can correspond to a curve in the time- frequency plane. The shape of this curve can vary from svara to svara and rāga to rāga. The set of svaras that define the rāga are dependent on the tonic. Unlike Western music, the main performer of a concert is at liberty to choose any frequency as the tonic of that concert. Once a tonic is chosen, the pitch positions of other notes are derived from the tonic. In a Carnatic music concert, this tonic is maintained by an instrument called tambura. The next important concept in Carnatic music is tāla. It is related to rhythm, speed, metre etc and it is a measure of time. There are various types of tāla that are characterised by different mātra (beat) count per cycle. The mātra is further subdivided into akshara. The count of akshara per mātra is decided by the nadai/gati of that tāla. For every composition, the main artist chooses a speed of the item to render. Once the speed is chosen, Carnatic music is reasonably strict about keeping the speed constant, but for inadvertent minor variations in speed due to human errors. The third important concept in Carnatic music is sāhitya or lyrics. Most of the lyrical compositions that are performed today have been written a few centuries ago. The composers were both musicians and poets and hence it can be seen that music and lyrics go together in their compositions. A Carnatic music concert is performed by a small ensemble of musicians as shown in Figure 1.1 2

Figure 1.1: A typical Carnatic music concert The main artiste is usually a vocalist but an artiste playing flute / veena / violin can also be a main artiste.

17 Figure 1.1: A typical Carnatic music concert The main artiste is usually a vocalist but an artiste playing flute / veena / violin can also be a main artiste. Where the main artiste is a vocalist or flautist, the melodic accompaniment is given by a violin artist. The percussion accompaniment is always given by a mrudangam artist. An additional percussion accompaniment in the form of ghatam / khanjira / morsing is optional. A tambura player maintains the tonic. A typical Carnatic music concert varies in duration from 9 mins to 3 hours and is made up of a succession of musical items. These items are standard lyrical compositions (kritis) with melodies in most cases set to a specific ra ga and rhythm structure set to a specific ta la. The kritis can be optionally preceded by an a la pana which is the elaboration of the ra ga. The main musician chooses a set of musical items that forms the concert. The choice of items is an important part of concert planning. While there is no hard and fast rules governing the choice of items, certain traditions are in the vogue for the past 7 years. There are various musical forms in Carnatic music, namely 3

18 varnam, kriti, jāvali, thillānā, viruttam, tiruppugazh, padam, rāga mālikā and rāgam tānam pallavi (RTP). Typically, a concert will start with a varnam, followed by a set of kritis. One or two kritis will be taken up for detailed exploration and rendering. In certain concerts, a RTP is taken up for detailed rendition. Towards the end of the concert, items such as jāvali, thillānā, viruttam are rendered. There are around 1 varnams, 5 kritis and a few hundred other forms available to choose. Musicians choose a set of items for a concert based on many parameters such as: The occasion of the concert - e.g. thematic concerts based on a certain rāga or a certain composer. Voice condition of the artiste - musical compositions that cover a limited octave range may be chosen in such cases. Also fast tempo items may be omitted. Contrast - Contrast in rāgas, variety in composers, rotation of various types of tāla, compositions covering lyrics of various languages, variation in tempo etc. While planning the items of a concert provides the outline of the concert, the worth of the musician is evident only in the creativity and spontaneity demonstrated by the artiste while presenting these chosen items. The creativity of the artiste gets exhibited by manodharma or improvised music which is made of : 1. ālāpana (melodic exploration of a rāgā without rhythm and lyrical composition) 2. niraval (spontaneous repetitions of a line of a lyric in melodically different ways conforming to the rāgā and rhythm cycle) 3. kalpana svara (spontaneous svara passages that conform to rāgā grammar with variety of rhythmic structures) Manodharma in Carnatic music is akin to an impromptu speech. The speaker himself/herself will not know what the next sentence is going to be. The quality of manodharma depends on a few factors such as : 4

19 1. Technical capability of the artiste - A highly accomplished artiste will have the confidence to take higher risks during improvisations and will be able to create attractive melodic and rhythmic patterns on the fly. 2. Technical capability of the co-artists - Since improvisations are not rehearsed, the accompanying artistes have to be on high alert closely following the moves of the main artiste and be ready to do their part of improvisations when the main artiste asks them to. 3. The mental and physical condition of the artiste - The artiste may decide not to exert himself/herself too much while traversing higher octaves or faster rhythmic neravals 4. Audience response and their pulse - If the audience comprises largely of people who do not have deep knowledge, it is prudent not to demonstrate too much of technical virtuosity. As we can see, unlike Western classical music, Carnatic music rendition clearly has two components - the taught / practised / rehearsed part (called kalpita) and the spontaneous on-the-stage improvisations (manodharma). A MIR system for Carnatic music should be able to identify / analyse / retrieve information pertaining not only to the kalpita part but also pertaining to the manodharma part. The unpredictability of the manodharma aspect makes MIR techniques used in Western classical music ineffective in Carnatcic music. At this stage, it is reiterated that the notes that make up the melody in Carnatic music are defined with respect to a reference called tonic frequency. Hence the analysis of a concert depends on the tonic. This makes MIR of Carnatic music non trivial. A melody when heard without a reference tonic can be perceived as a different rāga depending on the svara that is assumed as tonic. Two melodic motifs rendered with two different tonics will not show similarity unless they are tonic normalised. Figure shows the pitch contours of two similar melodic phrases rendered at different tonics. Any time series algorithm will give a high distance between the two phrases without tonic normalisation. The effect of tonic 1 Image courtesy Shrey Dutta 5

20 normalisation is also shown in this figure. Hence normalisation of these phrases with respect to tonic is important before any comparison is made. Figure 1.2: Tonic normalisation of two similar phrases While there are are so many aspects of Carnatic music that are candidates for analysis and retrieval, in this thesis we discuss the various segmentation techniques that can be applied for segmenting a Carnatc music item. 6

21 In Carnatic music, a song or composition typically comprises of three segmentspallavi, anupallavi and caranami-although in some cases there can be more segments. Segmentation of compositions is important both from the lyrical and musical aspects, as detailed in chapter 4. The three segments pallavi, anupallavi and caranam have different significance. From music perspective, one segment builds on the other. Also, one of the segments is usually taken up for elaboration in the form of neraval and kalpana svaram. From MIR perspective it is of importance to know which segment has been taken up by an artiste for elaboration. The ālāpanā is another creative exercise and the duration of an ālāpanā directly reflects a musician s depth of knowledge and creativity. So it is informative to know the duration of an ālāpanā performed by a main artiste and an accompanying artist. In this context, segmenting an item into ālāpanā and musical composition if of interest. Segmentation of compositions directly from audio is a well researched problem reported in the literature. Segmentation of a composition into its structural components using repeating musical structures (such as chorus) as a cue has several applications. The segments can be used to index the audio for music summarisation and browsing the audio (especially when an item is very long). While these techniques have been attempted for Western music where the repetitions have more or less static time-frequency melodic content, finding repetitions in improvisational music such as Carnatic music is a difficult task. This is because, the repetitions vary melodically from one repetition to another as illustrated in Figure 1.3. Here the pitch contours of four different repetitions (out of eight rendered) of the opening line of the composition vātāpi are shown. As we can see, unlike in Western music where segmentation using melodically 7

22 Figure 1.3: Typical melodic variations in repetitions invariant chorus is straightforward, segmentation using melodically varying repetitions of pallavi is a non trivial task. In this thesis, we discuss about segmenting a composition into its constituent parts using pallavi (or a part of pallavi) of the composition as the query template. As detailed in Chapter 4, the segments of a composition have lot of melodic and lyrical significance. Hence segmentation of a composition is a very important MIR task in Carnatic music. The repeated line of a pallavi is seen as a trajectory in the time-frequency plane. A sliding window approach is used to determine the locations of the query in the composition. The locations at which the correlation is maximum corresponds to matches with the query. 8

23 We further identify composition segment and the ālāpanā segment using difference in timbre due to absence of percussion in ālāpanā segment. This done by evaluating KL2 distance (which is a Information theoretic distance measure between two probability density functions) between adjacent samples and thereby locating the boundary of change. For all these types of segmentations, we use Cent filterbank cepstral coefficients (CFCC) as features by which the features are tonic independent and hence comparable across musicians and concerts where variation in tonic is possible. 1.2 Music Information retrieval There is an ever increasing availability of music in digital format which requires development of tools for music search, accessing, filtering, classification, visualisation and retrieval. Music Information Retrieval (MIR) covers many of these aspects. Technology for music recording, digitization, and playback allows users for an access that is almost comparable to the listening of a live performance. Two main approaches to MIR are common 1) metadata-based and 2) content-based. In the former, the issue is mainly to find useful categories for describing music. These categories are expressed in text. Hence, text-based retrieval methods can be used to search those descriptions. The more challenging approach in MIR is the one that deals with the actual musical content, e.g. melody and rhythm. In information retrieval, the objective is to find documents that match a users information need, as expressed in a query. In content-based MIR, this aim is usually described as finding music that is similar to a set of features or an example (query string). There are many different types of musical similarity such as : 9

24 Musical works that bring out the same emotion (romantic, sadness etc) Musical works that belong to the same genre (ex: classical, jazz etc) Musical works created by the same composer Music originating from the same culture( etc: Western, Indian) Varied repetitions of a melody In order to perform analyses of various kinds on musical data, it is sometimes desirable to divide it up into coherent segments. These segmentations can help in identifying the high-level musical structure of a composition or a concert and can help in better MIR. Segmentation also helps in creating thumbnails of tracks that are representative of a composition, thereby enabling surfers to sample parts of composition before they decide to listen / buy. The identification of musically relevant segments in music requires usage of large amount of contextual information to assess what distinguishes different segments from each other. In this work, we focus on segmentation as a tool for MIR in the context of Carnatic music items. 1.3 Carnatic Music - An overview The three basic elements of Carnatic music are rāga(melody), tāla (rhythm) and sāhitya (lyrics) Rāga Each rāga consists of a series of svaras, which bear a definite relationship to the tonic note (equivalent of key in Western music) and occur in a particular sequence 1

25 in ascending scale and descending scale. The rāgas form the basis of all melody in Indian Music. The character of each rāga is established by the order and the sequence of notes used in the ascending and descending scales and by the manner in which the notes are ornamented. These ornamentations, called gamakas, are subtle, and they are an integral part of the melodic structure. In this respect, rāga is neither a scale, nor a mode. In a concert, rāgas can be sung by themselves without any lyrics (called ālāpanā) and then be followed by a lyrical composition set to tune in that particular rāga. There are finite (72 to be exact) janaka (parent) rāgas and theoretically infinite possible janya (child) rāgas born out of these 72 parent rāgas. Rāgas are said to evoke moods such as tranquillity, devotion, anger, loneliness, pathos etc. [42, Page 8] Rāgas are also associated with certain time of the day, though it is not strictly adhered to in Carnatic music Tāla Tāla or the time measure is another principal element in Carnatic music. Tāla is the rhythmical groupings of beats in repeating cycles that regulates music compositions and provides a basis for rhythmic coordination between the main artistes and the accompanying artists. Hence, it is the theory of time measure. The beats (called the mātra s) are further divided into aksharās. Tāla encompasses both structure and tempo of a rhythmic cycle. Almost all musical compositions other than those sung as pure rāgas (ālāpanā) are set to a tāla. There are 18 tālas in theory [41, Page 17], out of which less than 1 are are commonly in practice. Ādi tāla (8 beat/ cycle) is the one most commonly used and is also universal. The laya is the tempo, which keeps the uniformity of time span. In a Carnatic music concert, the tāla is shown with standardized combination of claps and finger counts by the musician. 11

26 1.3.3 Sāhitya The third important element of Carnatic music is the sāhitya (lyrics). A musical composition presents a concrete picture of not only the rāga but the emotions envisaged by the composer as well. If the composer also happens to be a good poet, the lyrics are enhanced by the music, while preserving the metre in the lyrics and the music, leading to an aesthetic experience, where a listener not only enjoys the music but also the lyrics. The claim of a musical composition to permanence lies primarily in its musical setting. In compositions considered to be of high quality, the syllables of the sāhitya blends beautifully with the musical setting. Sāhitya serve as the models for the structure of a rāga. In rare rāgas such as kiranāvali even solitary works of great composer have brought out the nerve-centre of the rāga. The aesthetics of listening to the sound of these words is an integral part of the Carnatic experience, as the sound of the words blends seamlessly with the sound of the music. Understanding the actual meanings of the words seems quite independent of this musical dimension, almost secondary or even peripheral to the ear that seeks out the music. The words provide a solid yet artistic grounding and structure to the melody. 1.4 Carnatic Music vs Western Music While one may be tempted to approach MIR in Carnatic music similar to MIR in Western music, such attempts are quite likely to fail. There are some fundamental differences between Western and Indian classical music systems, which are important to understand as most of the available techniques on repetition detection and segmentation for Western music are ineffective for Carnatic music. The differences 12

27 between these two systems of music are outlined below: Harmony and Melody This is the prime difference between the two classical music system. The Western classical music is primarily polyphonic (i.e) different notes are sounded at the same time. The concept of western music lies on the harmony created by the different notes. Thus, we see different instruments sounding different notes being played at the same time, creating a different feel. It is the principle of harmony. Indian music system is essentially monophonic, meaning only single note is sung /played at a time [13, Chapter 1.3]. Its focus is on melodies created using a sequence of notes. Indian music focusses on expanding those svaras and expounding the melody aspect, and emotional aspect Composed vs improvised Western music is composed whereas Indian classical music is improvised. All Western compositions are formally written using the staff notation, and performers have virtually no latitude for improvisation. The converse is the case with Indian music, where compositions have been passed on from teacher to student over generations with improvisations in creative segments such as ālāpanā, niraval and kalpana svaras on the spot, on the stage Semitones, microtones and ornamentations Western music is largely restricted to 12 semi tones whereas Indian classical music makes extensive use of 22 microtones (called 22 shrutis though only 12 semi tones 13

are represented formally). In addition to microtones, Indian classical music makes liberal use of inflexions and oscillations of notes. In Carnatic Music, they are called gamakas.

28 are represented formally). In addition to microtones, Indian classical music makes liberal use of inflexions and oscillations of notes. In Carnatic Music, they are called gamakas. These gamakas act as ornamentations that describe the contours of a rāga. It is widely accepted that there are ten types of gamakas [45, Page 152]. A svara in Carnatic music is not a single point of frequency although it is referred to with a definite pitch value. It is perceived as movements within a range of pitch values around a mean. Figure compares the histograms of pitch values in a melody of rāga Sankārabharanam with its Hindustani equivalent (bilāval) and Western classical counterpart (major scale). We can see that the pitch histogram is continuous for Carnatic music and Hindustani music but it is almost discrete for Western music. It is clearly seen that the svaras are a range of pitch values in Indian classical music and this range is maximum for Carnatic music. Figure 1.4: Pitch histogram of rāaga Sankārabharanam with its Hindustani and Western classical equivalents The effect of gamakas on the note positions is illustrated in figure 1.5. The pitch trajectories of ārohana (ascending scale) of sankarābharanam rāaga in tonic E is compared with that of ascending scale of E-major, which is its equivalent. We can see that the pitch positions of many svaras of sankarābharanam move around its 2 Image courtesy Shrey Dutta 14

29 intended pitch values as the result of ornamentations. Figure 1.5: Effect of gamakas on pitch trajectory Notes - absolute vs relative frequencies In Western music, the positions of the notes are absolute. For instance, middle C is fixed at hz. In Carnatic music, the frequency of the various notes (svaras) are relative to the tonic note (called Sa or shadjam). Hence the svara Sa may be sung at C ( Hz) or G (392 Hz) or at any other frequency as chosen by the performer. The relationship between the notes remains the same in all cases. Hence Ga 1 is always three chromatic steps higher than Sa. Once the key/tonic for the svara Sa is chosen, then the frequencies for all the other notes are fully determined. There are also differences in ratios among the 12 notes between Western music and Indian music as provided in Table 1.1. In this table, the columns referring to harmonic are related to Western music This Table is courtesy: M V N Murthy, Professor, IMSc 15

30 Table 1.1: Differences in frequencies of the 12 notes for Indian Music and Western Music Note Natural Frequency Ratio Ratio Harmonic (Indian) (Hz-C-4) Indian Harmonic Ratio = 2 (1/12) 1 S R1 16/ R2/G1 9/ R3/G2 6/ G3 5/ M1 4/ M2 17/ P 3/ D1 8/ D2/N1 5/ D3/N2 9/ N3 15/ (S) Carnatic Music - The concert setting A typical Carnatic music concert has a main artiste who is mostly a vocalist who is accompanied on the violin, mrudangam (a percussion instrument) and optionally other percussion instruments. The main artiste chooses a tonic frequency to which the other accompanying artistes tune their instruments. This tonic frequency becomes the concert pitch for that concert. The tonic frequency for male vocal artistes are typically in the range 1-14 hz and for female vocalists in the range hz 1.6 Carnatic Music segments Typically, a Carnatic music concert is 1.5 to 3 hours of duration and is comprised of a series of musical items. A musical item in Carnatic music is broadly made up 16

31 of 2 segments. 1) A composition segment and 2) Optional ālāpanā segment which precedes the composition segment. These 2 segments can be further segmented as below: Composition The central part of every item is a a song or a composition which is characterised by participation of all the artistes on the stage. This segment has some lyrics (sāhitya) that is set to a certain melody (rāga) and rhythm (tāla). Typically this segment comprises of 3 sub-segments-pallavi, anupallavi and caranam, although in some cases there can be more segments due to multiple caranam segments. While many artistes render only one caranam segment (even if the composition has multiple caranam segments), some artistes do render multiple caranams or all the caranams. The pallavi part is repeated at the end of anupallavi and caranam Ālāpanā The composition can be optionally preceded by an ālāpanā segment.. If ālāpanā is present, the percussion instruments do not participate in it. Only the melodic aspect is expanded and explored without rhythmic support by the main artiste supported by the violin artist. There are no lyrics for ālāpanā. The main artiste does the ālāpanā followed by an optional ālāpanā sub-segment by the violin artiste. The above description is depicted in the figures 1.6 and

Relevance of MIR for Carnatic music 2. Challenges in MIR for Carnatic music 3.

32 Figure 1.6: Concert Segmentation Figure 1.7: Item segmentation 1.7 Contribution of the thesis The following are the main contributions of the thesis. 1. Relevance of MIR for Carnatic music 2. Challenges in MIR for Carnatic music 3. Representation of a musical composition as a time frequency trajectory. 4. Template matching of audio using t-f representation 5. Information theoretic approach to differentiate between composition segment that has percussion and melody segment without percussion 18

33 1.8 Organisation of the thesis The organization of the thesis is as follows: Chapter 1 outlined the work done and gives a brief introduction to Carnatic music that will help appreciate this work. In Chapter 2, some of the related work on music segmentation and various commonly used features have been discussed and their suitability to Carnatic music has been studied. Chapter 3 elaborates the approach and results for segmenting an item into ālāpanā and kriti. Chapter 4 elaborates the approach to segment a kriti into pallavi, anupallavi and caranam along with experimental results. Finally, Chapter 5 summarizes the work and discusses possible future work. 19

34 CHAPTER 2 Literature Survey 2.1 Introduction The manner in which humans listen to, interpret and describe music implies that it must contain an identifiable structure. Musical discourse is structured through musical forms such as repetitions and contrasts. The forms of the Western music have been studied in depth by music theorists and codified. Musical forms are used for pedagogical purposes, in composition as in music analysis and some of these forms (such as variations or fugues) are also principles of composition. Musical forms describe how pieces of music are structured. Such forms explain how the sections/ segments work together through repetition, contrast and variations. Repetition brings unity, and variation brings novelty and spark interest. The study of musical forms is fundamental in musical education as among other benefits, the comprehension of musical structures leads to a better knowledge of composition rules, and is the essential first approach for a good interpretation of musical pieces. Every composition in Indian classical music has these forms and are often an important aspect of what one expects when listening to music. The terms used to describe that structure varies according to musical genre. However it is easy for humans to commonly agree upon musical concepts such as melody, beat, rhythm, repetitions etc. The fact that humans are able to distinguish between these concepts implies that the same may be learnt by a machine using

35 signal processing and machine learning. Over the last decade, increase in computing power and advances in music information retrieval have resulted in algorithms which can extract features such as timbre [3], [29], [5], tempo and beats [35], note pitches [26] and chords [32] from polyphonic, mixed source digital music files e.g. mp3 files, as well as other formats. Structural segmentation of compositions directly from audio is a well researched problem in the literature, especially for Western music. Automatic audio segmentation (AAS), is a subfield of Music information retrieval (MIR) that aims at extracting information on the musical structure of songs in terms of segment boundaries, repeating structures and appropriate segment labels. With advancing technology, the explosion of multimedia content in databases, archives and digital libraries has resulted in new challenges in efficient storage, indexing, retrieval and management of this content. Under these circumstances, automatic content analysis and processing of multimedia data becomes more and more important. In fact, content analysis, particularly content understanding and semantic information extraction, have been identified as important steps towards a more efficient manipulation and retrieval of multimedia content. Automatically extracted structural information about songs can be useful in various ways, including facilitating browsing in large digital music collections, music summarisation, creating new features for audio playback devices (skipping to the boundaries of song segments) or as a basis for subsequent MIR tasks. Structural music segmentation consists of dividing a musical piece into several parts or sections and then assigning to those parts identical or distinct labels according to their similarity. The founding principles of structural segmentation are homogeneity, novelty or repetition. 21

36 Repetition detection is a fundamental requirement for music thumbnailing and music summarisation. These repetitions are also often the chorus part of a popular music piece that are thematic and musically uplifting. For these MIR tasks, a variety of approaches have been discussed in the past. Previous attempts at music segmentation involved segmenting by spectral shape, segmenting by harmony, and segmenting by pitch and rhythm. While these methods exhibited some amount of success, they generally resulted in over segmentation (identification of segments at locations where segments do not exist). In this chapter, under section 2.2, we will summarise some of the approaches attempted by the research community for segmentation and repetition detection tasks. In section, 2.3, we will review the various audio features commonly used by speech and music community. We will conclude with our chosen feature and its suitability for Carnatic music. 2.2 Segmentation Techniques The authors of [39] discuss three fundamental approaches to music segmentation - a) novelty-based where transitions are detected between contrasting parts b) homogeneity-based where sections are identified based on consistency of their musical properties, and c) repetition-based where recurring patterns are determined. In the following subsections, we will do a literature study on the segmentation approaches carried out using machine learning and other approaches. 22

37 2.2.1 Machine Learning based approaches In model-based segmentation approaches used in Machine learning, each audio frame is separately classified to a specific sound class, e.g. speech vs music, vocal vs instrumental, melody vs rhythm etc. In particular, a model is used to represent each sound class. The models for each class of interest are trained using training data. During the testing (operational) phase, a set of new frames is compared against each of the models in order to provide decisions (sound labelling) at the frame-level. Frame labelling is improved using post processing algorithms. Next, adjacent audio frames labelled with the same sound class are merged to construct the detected segments. In the model-based approaches the segmentation process is performed together with the classification of the frames to a set of sound categories. The most commonly used machine learning algorithms in audio segmentation are the Gaussian Mixture Model (GMM), Hidden Markov Model (HMM), Support Vector Machine (SVM) and Artificial Neural Network (ANN). In [4] a 4-state ergodic HMM is trained with all possible transitions to discover different regions in music, based on the presence of steady statistical texture features. The Baum-Welch algorithm is used to train the HMM. Finally, segmentation is deduced by interpreting the results from the Viterbi decoding algorithm for the sequence of feature vectors for the song. In [3], an automatic segmentation approach is proposed that combines SVM classification and audio self-similarity segmentation. This approach firstly separates the sung clips and accompaniment clips from pop music by using SVM preliminary classification. Next, heuristic rules are used to filter and merge the classification result to determine potential segment boundaries further. And fi- 23

38 nally, a self similarity detecting algorithm is introduced to refine segmentation results in the vicinity of potential points. In [31], HMM is used as one of the methods to discover song structure. Here the song is first parameterised using MFCC features. Then these features are used to discover the song structure either by clustering fixed-length segments or by an HMM. Finally using this structure, heuristics are used to choose the key phrase. In [16], techniques such as Wolff-Gibbs algorithm, HMM and prior distribution are used to segment an audio. In [38], a fitness function for the sectional form descriptions is used to select the one with the highest match with the acoustic properties of the input piece. The features are used to estimate the probability that two segments in the description are repeats of each other, and the probabilities are used to determine the total fitness of the description. Since creating the candidate descriptions is a combinatorial problem a novel greedy algorithm constructing descriptions gradually is proposed to solve it. In [1], the audio frames are first classified based on their audio properties, and then agglomerated to find the homogeneous or self-similar segments. The classification problem is addressed using an unsupervised Bayesian clustering model, the parameters of which are estimated using a variant of the EM algorithm. This is followed by beat tracking, and merging of adjacent frames that might belong to the same segment. In [43], segmentation of a full length Carnatic music concert into individual items using applause as a boundary is attempted. Applauses are identified for a concert using spectral domain features. GMMs are built for vocal solo, violin solo and composition ensemble. Audio segments between a pair of the applauses 24

39 are labeled as vocal solo, violin solo, composition ensemble etc. The composition segments are located and the pitch histograms are calculated for the composition segments. Based on similarity measure the composition segment is labelled as inter-item or intra-item. Based on the inter-item locations, intra-item segments are merged into the corresponding items In [48], a Convolutional Neural Networks (CNN) is trained directly on melscaled magnitude spectrograms. The CNN is trained as a binary classifier on spectrogram excerpts, and it includes a larger input context and respects the higher inaccuracy and scarcity of segment boundary annotations. The author(s) of [23] use CNN with spectrograms and self-similarity lag matrices as audio features, thereby capturing more facets of the underlying structural information. A late time-synchronous fusion of the input features is performed in the last convolutional layer, which yielded the best results Non machine learning approaches Non machine learning approaches have primarily used time frequency features or distance measures to identify segment boundaries. Distance based audio segmentation algorithms estimate segments in the audio waveform, which correspond to specific acoustic categories, without labelling the segments with acoustic classes. The chosen audio is blocked into frames, parametrised, and a metric based on distance is applied to feature vectors that are adjacent thus estimating what is called a distance curve. The frame boundaries correspond to peaks of the distance curve where the distance is maximized. These are positions with high acoustic change, and hence are considered as candidate 25

40 audio segment boundaries. Post-processing is done on the candidate boundaries for the purpose of selecting which of the peaks on the distance curve will be identified as audio segment boundaries. The sequence of segments will not be classified to a specific audio sound category at this stage. The categorization is usually performed by a machine learning based classifier as the next stage. Foote was the first to use a auto-correlation matrix where a song s frames are matched against themselves. The author(s) of [18] describe methods for automatically locating points of significant change in music or audio, by analysing local self-similarity. This approach uses the signal to model itself, and thus does not rely on particular acoustic cues nor requires training. This approach was further enhanced in [6], where a self similarity matrix followed by dynamic time warping (DTW) was used to find segment transitions and repetitions. In [51] unsupervised audio segmentation using Bayesian Information Criterion is used. After identifying the candidate segments using Euclidean distance, delta-bic integrating energy-based silence detection is employed to perform the segmentation decision to pick the final acoustic changes. In [52], anchor speaker segments are identified using Bayesian Information Criterion to construct a summary of broadcast news. In [7], three divide-and-conquer approaches for Bayesian information criterion based speaker segmentation are proposed. The approaches detect speaker changes by recursively partitioning a large analysis window into two sub-windows and recursively verifying the merging of two adjacent audio segments using Delta BIC. In [9], a two pass approach is used for speaker segmentation. In the first pass, 26

41 GLR distance is used to detect potential speaker changes, and in second pass, BIC is used to validate these potential speaker changes. In [8], the authors describe a system that uses agglomerative clustering in music structure analysis of a small set of Jazz and Classical pieces. Pitch, which is used as the feature, is extracted and the notes are identified from the pitch. Using the sequence of notes, the melodic fragments that repeat are identified using a similarity measure. Then clusters are formed from pairs of similar phrases and used to describe the music in terms of structural relationships. In [17], the authors propose a dissimilarity matrix containing a measure of dissimilarity for all pairs of feature tuples using MFCC features. The acoustic similarity between any two instants of an audio recording is calculated and displayed as a two-dimensional representation. Similar or repeating elements are visually distinct, allowing identification of structural and rhythmic characteristics. Visualization examples are presented for orchestral, jazz, and popular music. In [21], a feature-space representation of the signal is generated; then, sequences of feature-space samples are aggregated into clusters corresponding to distinct signal regions. The clustering of feature sets is improved via linear discriminant analysis; dynamic programming is used to derive optimal cluster boundaries. In [22], the authors describe a system called RefraiD that locates repeating structural segments of a song, namely chorus segments and estimates both ends of each section. It can also detect modulated chorus sections by introducing a perceptually motivated acoustic feature and a similarity that enable detection of a repeated chorus section even after modulation. Chorus extraction is done in four stages - computation of acoustic features and similarity measures, repetition judgement criterion, estimating end-points of repeated sections and detecting 27

42 modulated repetitions In [33], the structure analysis problem is formulated in the context of spectral graph theory. By combining local consistency cues with long-term repetition encodings and analyzing the eigenvectors of the resulting graph Laplacian, a compact representation is produced that effectively encodes repetition structure at multiple levels of granularity. In [46], the authors describe a novel application of the symmetric Kullback- Leibler distance metric that is used as a solution for segmentation where the goalis to produce a sequence of discrete utterances with particular characteristics remaining constant even when speaker and the channel change independently. In [34], a supervised learning scheme using ordinal linear discriminant analysis and constrained clustering is used. To facilitate abstraction over multiple training examples, a latent structural repetition feature is developed, which summarizes the repetitive structure of a song of any length in a fixed-dimensional representation. 2.3 Audio Features In machine learning, choosing a feature, which is an individual measurable property of a phenomenon is critical. Extracting or selecting features is both an art and science as it requires experimentation of multiple possible features combined with domain knowledge. Features are usually numeric and represented by feature vectors. Perception of music is based on the temporal, spectral and spectro-temporal features. For our work, we could broadly divide the audio features into the following groups : Temporal 28

43 Spectral Cepstral Distance based Temporal Features Speech and vocal music are produced from a time varying vocal tract system with time varying excitation. For musical instruments the audio production model is different from vocal music. Still, the system and the excitation are time variant. As a result the speech and music signals are non-stationary in nature. Most of the signal processing approaches studied for signal processing assume time invariant system and time invariant excitation, i.e. stationary signal. Hence these approaches are not directly applicable for speech and music processing. While the speech signal can be considered to be stationary when viewed in blocks of 1-3 msec windows, music signal can be considered to be stationary when viewed in blocks of 5-1 msec windows. Some of the short term parameters are discussed here. Short-Time Energy (STE) : The Short-Time Energy of an audio signal is defined as E n = (x[m]) 2 w [n m] (2.1) m= where, w [n] is a window function. Normally, a Hamming window is used. RMS : The root mean square of the waveform calculated in the time domain to indicate its loudness. It is a measure of amplitude in one analysis windows and is defined as RMS = x 12 + x 22 + x x n 2 where n is the number of samples within an analysis window and x is the value of the sample. Zero-Crossing Rate (ZCR): It is defined as the rate at which the signal crosses zero. It is a simple measure of the frequency content of an audio signal. Zero n 29

44 Crossings are also useful to detect the amount of noise in a signal. The ZCR is defined as Z n = m= sgn (x[m]) sgn (x[m 1]) w[n m] (2.2) where sgn(x[n]) = { 1, x[n] 1, x[n] < where x[n] is a discrete time audio signal, sgn(x[n]) is the signum function and w [n] is a window function. ZCR can also be used to distinguish between voiced and unvoiced speech signals as unvoiced speech segments normally have much higher ZCR values than voiced segments. Pitch: Pitch is an auditory sensation in which a listener assigns musical tones to relative positions on a musical scale based primarily on frequency of vibration. Pitch, often used interchangeably with fundamental frequency, provides important information about an audio signal that can be used for different tasks including music segmentation [25], speaker recognition [2] and speech analysis and synthesis purposes [47]. Generally, audio signals are analysed in the time-domain and spectral-domain to characterise a signal in terms of frequency, amplitude, energy etc. But there are some audio characteristics such as pitch, which are missing from spectra, which are useful for characterising a music signal. Spectral characteristics of a signal can be affected by channel variations, whereas pitch is unaffected by such variations. There are different ways to estimate pitch of an audio signal as explained in [2]. Autocorrelation: It is the correlation of a signal with a delayed copy of itself as a function of delay. This is achieved by providing different time lag for the sequence and computing with the given sequence as reference Spectral Features A temporal signal can be transferred into spectral domain using suitable spectral transformation, such as the Fourier transform. There are a number of co-efficients that can be derived from Fast Fourier Transform(FFT) such as : Spectral centroid : It indicates a region with the biggest density of frequency representation in the audio signal. The spectral centroid is commonly associated with the measure of the brightness of a sound. This measure is obtained by evaluating the center of gravity using the Fourier transforms frequency 3

45 and magnitude information. The individual centroid of a spectral frame is defined as the average frequency weighted by amplitudes, divided by the sum of the amplitude. C = N 1 kx[k] N 1 X[k] where X[k] is the magnitude of the FFT at frequency bin k and N the number of frequency bins. Using this feature, in [1], a sound stream is segmented by classifying each sub-segment into silence, pure speech, music, environmental sound, speech over music, and speech over environmental sound classes in multiple steps. Spectral Flatness : It is the flatness of the spectrum as represented by the ratio between the geometric and arithmetic means. Its output can be seen as a measure of the tonality/noisiness of the sound. A high value indicates a flat spectrum, typical of noise-like sounds or ensemble sections. On the other hand, harmonic sounds produce low flatness values, an indicator for solo phrases. SF n = X n [k] (1/K) 1 1 X n [k] K k where k is the frequency bin index of the magnitude spectrum X at frame n. In [24] and [19], a method that utilizes a spectral flatness based tonality feature for segmentation and content based retrieval of audio is outlined. Spectral flux : Spectral flux is a measure of how quickly the power spectrum of a signal is changing, calculated by comparing the power spectrum for one frame against the power spectrum from the previous frame. More precisely, it takes the Euclidean norm between the two spectra, each one normalized by its energy. It is defined as 2-norm of two adjacent frames. SF[n] = ( X n (e jω ) X n+1 (e jω ) ) 2 dω (2.3) ω where X n (e jω ) is the Fourier Transform of the n th frame of the input signal and is defined as X n (e jω ) = w [n m] x [m] e jωm (2.4) m= In [53], spectral flux is one of the features used to segment an audio stream on the basis of its content into four main audio types: pure-speech, music, environment sound, and silence. Spectral Crest - The shape of the spectrum is described by this feature. it is a measure for the peakiness of a spectrum and is inversely proportional 31 k

46 to the spectral flatness. It is used to distinguish between sounds that are noise-like and tone-like. Noise like spectra will have a spectral crest near 1. It is calculated by the formula ( max(xn [k]) ) 1 K X n [k] In [19], spectral crest is used as one of the features to detect solo phrases in Music. Spectral roll-off - It determines a threshold below which the biggest part of the signal energy resides. The roll-off is a measure of spectral shape. Spectral rolloff point is defined as the N th percentile of the power spectral distribution, where N is usually 85%. The roll-off point is the frequency below which the N% of the magnitude distribution is concentrated. In [27], modified spectral roll-off is used to segment between speech and music. Spectral skewness : This is a statistical measure of the asymmetry of the probability distribution of the audio signal spectrum. It indicates whether or not the spectrum is skewed towards a particular range of values. Spectral slope : It characterize the loss of signal s energy at higher frequencies. It is a measure of how quickly the spectrum of an audio sound tails off towards the high frequencies, calculated using a linear regression on the amplitude spectrum. Spectral Entropy: It is a measure of randomness of a system. It is calculated as below: - Calculate the spectrum X[k]) of the signal. - Calculate the power spectral density (PSD) of the signal by squaring its amplitude and normalizing by the number of bins. - Normalize the calculated PSD so that it can be viewed as a probability density function (integral is equal to 1) - The Power Spectral entropy can be now calculated using a standard formula for an entropy calculation k 1 where p i is the normalised PSD. PSE = n p i ln p i i=1 32

47 2.3.3 Cepstral Features Cepstral analysis originated from speech processing. Speech is composed of two components - the glottal excitation source and the vocal tract system. These two components have to be separated from the speech in order to analyze and model independently. The objective of cepstral analysis is to separate the speech into its source and system components without any a priori knowledge about source and / or system. Because these two component signals are convolved, they cannot be easily separated in the time domain. The cepstrum c is defined as the inverse DFT of the log magnitude of the DFT of the signal x. c[n] = F 1 {log F {x[n]} } where F is the DFT and F 1 is the IDFT. Cepstral analysis measures rate of change across frequency bands. The cepstral coefficients are a very compact representation of the spectral envelope. They are also (to a large extent) uncorrelated. Glottal excitation is captured by the coefficients where n is high and the vocal tract response, by those where n is low. For these reasons, cepstral coefficients are widely used in speech recognition, generally combined with a perceptual auditory scale. fields: We discuss some types of cepstral coefficients used in speech and music analysis Linear prediction Cepstral co-efficients (LPCC) : For finding the source (glottal excitation) and system (vocal tract) components from time domain itself, the linear prediction analysis was proposed by Gunnar Fant [15] as a linear model of speech production in which glottis and vocal tract are fully decoupled. Linear prediction calculates a set of coefficients which provide an estimate - or a prediction - for a forthcoming output sample. The commonest 33

48 form of linear prediction used in signal processing is where the output estimate is made entirely on the basis of previous output samples. The result of LPC analysis then is a set of coefficients a[1..k] and an error signal e[n], the error signal will be as small as possible and represents the difference between the predicted signal and the original According to the model, the speech signal is the output y[n] of an all-pole representation 1 excited by x[n]. The filter 1 A(z) A p is known as the synthesis (z) filter. This implicitly introduces the concept of linear predictability, which gives the name to the model. Using this model speech signal can be expressed as y[n] = p a k x[n k] + e[n] k=1 which states that the speech sample can be modeled as a weighted sum of the p previous samples plus some excitation contribution. In linear prediction, the term e[n] is usually referred to as the error (or residual). LP parameters {a i } are estimated such that the error is minimised. The techniques used for this are covariance method and auto-correlation method. The LP coefficients are too sensitive to numerical precision. A very small error can distort the whole spectrum, or make the prediction filter unstable. So it is often desirable to transform LP coefficients into cepstral coefficients. LPCC are Linear Prediction Coefficients (LPC) represented in the cepstrum domain. The cepstral co-efficients of LPCC are derived as below: if n < ; ln(g), if n = ; c(n) = a n + n 1 k=1( k n )c(k)a n-k if < n p n 1 if n > p k=n p( k n )c(k)a n-k Though LP coefficients and LPCC are widely used in speech analysis and synthesis tasks, it is not directly used for audio segmentation. However, a related feature called Line spectral frequencies (LSF) has been used for audio segmentation. LSFs are an alternative to the direct form linear predictor coefficients. They are an alternate parametrisation of the filter with a oneto-one correspondence with the direct form predictor coefficient. They are not very sensitive to quantization noise and are also stable. Hence they are widely used for quantizing LP filters. In [11], LSFs are used as the core feature for speech - music segmentation. In addition to this, a new feature, the linear prediction zero-crossing ratio (LP-ZCR) is also used which is defined as the ratio of the zero crossing count of the input and the zero crossing count of the output of the LP analysis filter. Mel-Frequency Cepstrum Coefficients (MFCC): The motivation for using Mel-Frequency Cepstrum Coefficients was due to the fact that the auditory 34

49 response of the human ear resolves frequencies non-linearly. MFCC was first proposed in [36]. The mapping from linear frequency to me1 frequency is defined as ( f mel = 2595 log f ) 7 The steps involved in extracting MFCC feature is shown in the below figure: Figure 2.1: Block diagram of MFCC extraction Bark frequency cepstral coefficients (BFCC): The Bark scale, another perceptual scale, divides the audible spectrum into 24 critical bands that try to mimic the frequency response of the human ear. Critical bands refer to frequency ranges corresponding to regions of the basilar membrane that are excited when stimulated by specific frequencies. Critical band boundaries are not fixed according to frequency, but dependent upon specific stimuli. Relative bandwidths are more stable, and repeated experiments have found consistent results. In frequency, these widths remain more or less constant at 1 Hz for center frequencies up to 5 Hz, and are proportional to higher center frequencies by a factor of.2. The relation between frequency scale and Bark scale is as below: ( ) 2 f f Bark = 6 ln In [37] BFCC is used for real-time instrumental sound segmentation and labeling. Harmonic Cepstral Coefficients (HCC) : In the MFCC approach, the spectrum envelope is computed from energy averaged over each mel-scaled filter. This may not work well for voiced sounds with quasi periodic features, as the formant frequencies tend to be biased toward pitch harmonics, and formant 35

50 bandwidth may be mis-estimated. To overcome this shortcoming, instead of averaging the energy within each filter, which results in a smoothed spectrum in MFCC, harmonic cepstral coefficients (HCC) are derived from the spectrum envelope sampled at pitch harmonic locations. This requires robust pitch estimation and voiced/unvoiced/transition (V/UV/T) classification performed. This is accomplished using spectro-temporal auto-correlation (STA) followed by peak-picking algorithm; the block diagram of HCC is shown below: Figure 2.2: Block diagram of HCC analysis Distance based Features The distance metrics are distance-based algorithms that perform an analysis over a stream of data to find that point which gives the optimum characteristic event of interest. Many functions have been proposed in the audio segmentation literature, mainly because they can be blind to the audio stream characteristics i.e. type of audio (recording conditions, number of acoustic sources, etc) or type of the upcoming audio classes (speech, music, etc). The most commonly used are: The Euclidean distance 36

51 This is the simplest distance metric for comparing two windows of feature vectors. For distance between two distributions, we take the distance between only the means of the two distributions. For two windows of audio data described as Gaussian models G 1 (µ 1, Σ 1 ) and (G 2 (µ 2, Σ 2 ), the Euclidean distance metric is given by: (µ 1 µ 2 ) T (µ 1 µ 2 ) The Bayesian information criterion(bic) The Bayesian information criterion aims to find the best models that describe a set of data. From the two given windows of audio stream the algorithm computes three models representing the windows separately and jointly. From each model the formula extracts the likelihood and a complexity term that expresses the number of the model parameters. For two windows of audio data described as Gaussian models G 1 (µ 1, Σ 1 )and(g 2 (µ 2, Σ 2 ) and with their combined windows described as G(µ, Σ), the BIC distance metric is evaluated as below: BIC = BIC(G1) + BIC(G2) BIC(G) N log Σ BIC(G) = λ(d + d(d 1) log N ) 2 dn log 2π N N log Σ BIC = N 1 log Σ 1 N 1 log Σ 2 λd λd 4 (d+1)(log N 1+log N 2 log N) where N, N1, N2 are the number of frames in the corresponding streams, d is the number of features of the feature vectors and λ is an experimentally factor In [5], BIC is used detect acoustic change due to speaker change which in turn is used for segmentation based on speaker change. The Generalized Likelihood Ratio(GLR): When we process music, context is very important. We therefore like to understand the trajectory of features as a function of time. GLR is a simplification of the Bayesian Information Criterion. Like BIC, it finds the difference between two windows of audio stream using the three Gaussian models that describe these windows separately and jointly. For two windows of audio data described as Gaussian models G 1 (µ 1, Σ 1 ) and G 2 (µ 2, Σ 2 ), the GLR distance is given by: where w is the window size. GLR = w(2 log Σ] log[σ 1 ] log Σ 2 ) In[49], segmenting an audio stream into homogeneous regions according to speaker identities, background noise, music, environmental and channel conditions is proposed using GLR. 37

52 KL2 Distance Metric based segmentation is a popular technique for segmentation. It relies on the computation of a distance between two acoustic segments to determine whether they have similar timbre or not. Change in timbre is an indicator of change in acoustic characteristics such as speaker, musical instrument, background ambience etc. KL divergence is an information theoretic likelihood-based non-symmetric measure that gives the difference between two probability distributions P and Q. The larger this value, the greater the difference between these PDFs. It is given given by: D KL (P Q) = P(i) log P(i) Q(i). (2.5) i As mentioned in [46], since D KL (P Q) measure is not symmetric, it can not be used as a distance metric. Hence its variation, KL2 metric is used here for distance computation. It is defined as follows D KL2 (P, Q) = D KL (P Q) + D KL (Q P) (2.6) A Gaussian distribution computed on a window of fourier transformed cent normalised spectrum is considered as a probability density function. KL2 distance is computed between adjacent frames to determine the divergence between two adjacent spectra. In [46], KL2 distance is used to detect segment boundaries where speaker change or channel change occur. The Hotelling T2 statistic is another popular tool for comparing distributions. The main difference with KL2 is the assumption that the two comparing windows of audio stream have no difference on their covariances. For two windows of audio data described as Gaussian models G 1 (µ 1, Σ 1 ) and G 2 (µ 2, Σ 2 ), the Hotelling T2 distance metric is given by: T 2 = N 1N 2 N 1 + N2 (µ 1 µ 2 ) T Σ 1 (µ 1 µ 2 ) where Σ equals Σ 1 and Σ 2 and N 1, N 2 are the number of frames in the corresponding streams In [54], Hotelling T2 statistic is used to pre-select candidate segmentation boundaries followed by BIC to perform the segmentation decision. 38

53 2.4 Discussions While these techniques have been attempted for Western music where the repetitions have more or less static time-frequency melodic content, finding repetitions in improvisational music is a difficult task. In Indian music, the melody content of the repetitions varies significantly (Fig 1.3) during repetitions within the same composition due to the improvisations performed by the musician. A musician s rendering of a composition is considered rich, if (s)he is able improvise and produce a large number of melodic variants of the line while preserving the grammar, rhythmic structure and the identity of the composition. Another issue that needs to be addressed is of the tonic. The same composition when rendered by different musicians can be sung in different tonics. Hence matching a repeating pattern of a composition across recordings of various musicians requires a tonic-independent approach. The task of segmenting an item into ālāpanā and kriti in Carnatic music involves differentiating between the textures of the music during ālāpanā and kriti. While the kriti segment involves both melody and rhythm and hence includes the participation of percussion instruments, the ālāpanā segment involves only melody contributed to by lead performer and the accompanying violinist. It has been well established in [43] that MFCC features are not suitable for modelling music analysis tasks where there is a dependency on the tonic. When MFCCs are used to model music, a common frequency range is used for all musicians, which does not give the best results when variation in tonic is factored in. With machine learning techniques, when MFCC features are used, training and testing datasets should have the same tonic. This creates problems when music is 39

54 compared across tonics as the tonic can vary from concert to concert and musician to musician. To address the issue of tonic dependency, a new feature called cent filterbank (CFB) energies was introduced in [43]. Hence, modelling of Carnatic music using cent filter-bank (CFB) based features that are normalised with respect to the tonic of the performance, namely CFB Energy and Cent Filterbank Cepstral Coefficients (CFCC), is the preferred approach for this thesis. Figure 2.3: Filter-banks and filter-bank energies of a melody segment in the mel scale and the cent scale with different tonic values CFB Energy Feature The cent is a logarithmic unit of measure used for musical intervals. Twelve-tone equal temperament divides the octave into 12 semitones of 1 cents each. An 4

55 octave (two notes that have a frequency ratio of 2:1) spans twelve semitones and therefore 12 cents. As mentioned earlier, notes that make up a melody in Carnatic music are defined with respect to the tonic. The tonic chosen for a concert is maintained throughout the concert using an instrument called the tambura (drone). The analysis of a concert therefore should depend on the tonic. The tonic ranges from 18 Hz to 22 Hz for female and 1 Hz to 14 Hz for male singers. Tonic normalisation in CFB removes the spectral variations. This is illustrated in Fig that shows time filter-bank energy plots for both the mel scale and cent scale. The time filter-bank energies are shown for the same melody segment as sung by a male and a female musician. Filter-bank energies and filter-banks are plotted for two different musicians (male motif with tonic 134 Hz and female motif with tonic 145 Hz) with different tonic values. In the case of mel scale, filters are placed across the same frequencies for every concert irrespective of the tonic values, whereas, in the case of the cent scale, the filter-bank frequencies are normalised with respect to the tonic. The male and female motifs are clearly emphasised irrespective of the tonic values in the cent scale, and are not clearly emphasised in the mel scale. CFB energy feature extraction is carried out as below: 1. The audio signal is divided into frames. 2. The short-time DFT is computed for each frame. 3. The frequency scale is normalised by the tonic. The cent scale is defined as: Cent = 12. log 2 (f / tonic) 4. Six octaves corresponding to [ 12 : 6] cents are chosen for every musician. While upto 3 octaves can be covered in a concert, the instruments produce harmonics which are critical to capture the timbre. The choice of six harmonics is to capture the rich harmonics involved in musical instruments. 1 Image courtesy Padi Sarala 41

56 5. The cent normalised power spectrum is then multiplied by a bank of 8 filters that are spaced uniformly in the linear scale to account for the harmonics of pitch. The choice of 8 filters is based on experimentations in [43]. 6. The filterbank energies are computed for every frame and used as a feature after removing the bias. CFB energy features were extracted for every frame of length 1 ms of the musical item, with a shift of 1 ms. Thus, a 8 dimensional feature is obtained for every 1 ms of the item, resulting in N feature vectors for the entire item CFB Slope Feature In Carnatic music, a collective expression of melodies that consists of svaras (ornamented notes) in a well defined order constitute phrases (aesthetic threads of ornamented notes) of a rāga. Melodic motifs are those unique phrases of a rāga that collectively give a rāga its identity. In Fig. 4.2, it can be seen that the the presence of the strokes due to the mrudangam destroys the melodic motif. To address this issue, cent filterbank based slope was computed along frequency. Let the vector of log filter bank energy values be represented as F i = ( f 1,i, f 2,i,..., f n f,i) t, where n f is the number of filters. Mean subtraction on the sequence F i, where i = 1,2,..,n is applied as before. Here, n is the number of feature vectors in the query. To remove the effect of percussion, slope values across consecutive values in each vector F i are calculated. Linear regression over 5 consecutive filterbank energies is performed. A vector of slope values s = (s 1,i, s 2,i,..., s F 1,i ) t for each frame of music is obtained as a result. 42

57 2.4.3 CFCC Feature To arrive at Cent filterbank cepstral coefficients (CFCC) feature, after carrying out the steps enumerated in section 2.4.1, DCT is applied on filterbank energies to de-correlate and the required co-efficients are retained. 43

58 CHAPTER 3 Identification of ālāpaā and kriti segments 3.1 Introduction Ālāpanā (Sanskrit: dialogue) is a way of rendition to explore the features and beauty of a rāga. Since ālāpanā is purely melodic with no lyrical and rhythmic components, it is best suited to bring out the various facets of a rāga [4, Chapter 4]. The performer brings out the beauty of a rāga using creativity and internalised knowledge about the grammar of the rāga. During ālāpanā, the performer improvises each note or a set of notes gradually gliding across octaves, emphasising important notes and motifs thereby evoking the mood of the rāga. After the main artiste finishes the ālāpanā, optionally the accompanying violinist may perform an ālāpanā in the same rāga. The kritis are central to any Carnatic music concert. Every kriti is a confluence of 3 aspects lyrics, melody and rhythm. Every musical item in a concert will have the mandatory kriti segment and optionally ālāpanā segment. The syllables of a lyrics of the kriti go hand in hand with the melody of the rāga thereby enriching the listening experience. The lyrics are also important in Carnatic music. While the rāga evokes certain emotional feelings, the lyrics further accentuate it, adding to the aesthetics and listening experience. In this chapter, we will describe an approach to identify the boundary separating ālāpanā and kriti using KL2,GMM and CFB Energy Feature. In section 3.2, we

59 will describe our algorithm used for the segmentation. Under section 3.3, we will be discussing the results of our experiments. We will conclude this chapter with discussions on the results. 3.2 Segmentation Approach Boundary Detection In order to detect the boundary separating ālāpanā and kriti, individual feature vectors need to be labelled. One naive approach to find the boundary would be to label each and every feature vector. Since each feature vector corresponds to 1 ms, and a musical item can last anywhere between 3 mins to 3 mins, there is a need to label too many feature vectors for the entire musical item. Moreover, there would be small intervals of time during the kriti, when percussion content would be absent either due to inter-stroke silence or due to aesthetic pauses deliberately introduced by the percussionist. So, a better approach would be to extract a segment of feature vectors from the item and try to label the segment as a whole. Hence, finding the boundary between ālāpanā and kriti would involve: Iterate over the N feature vectors, one at a time. Consider a segment of specified length to the left and right of the current feature vector. Use a machine learning technique to label these two segment as a whole. This reduces the resolution of the segmentation process to the segment length. Use music domain knowledge to correct and agglomerate the labels to find the boundary between ālāpanā and kriti. 45

60 This approach is computationally intensive. To further improve the efficiency of this process, we have to reduce the search space for the boundary. The following approach using KL2 was used: Iterate over the N feature vectors, one at a time. Consider a sliding window consisting of a sequence of 5 feature vectors (5 seconds), W n, where n denotes the starting position. n = 1, 2..N 5 Average the density function obtained earlier for the entire window length. Calculate KL2 distance between 2 successive frames of music, W n and W n+1. Larger values of KL2 distance denote large change in distribution. A threshold was automatically chosen such that, there is 3 seconds spacing between adjacent peaks of KL2 values. This is to prevent the algorithm from generating too many change points. The choice of 3 seconds was empirically arrived at, as a trade-off between accuracy and efficiency. The peaks extracted will correspond to a array of K possible boundaries B = [b 1, b 2,.., b K ] between ālāpanā and kriti Ālāpanā Kriti KL2 Value KL2 Value Boundary Points Automatically Chosen Threshold Selected Peak Feature Vector Index x 1 4 Figure 3.1: KL2 Values and possible segment boundaries. Fig. 3.1 shows the output of the algorithm described above. 46

61 3.2.2 Boundary verification using GMM From the K possible boundary values, the actual boundary between kriti and ālāpanā needs to be identified. In order to verify the boundaries GMMs were used. GMMs were trained using CFB energy features after applying DCT for compression. GMMs were trained for both the classes kriti and ālāpanā using a training dataset with 32 mixtures per class. The approach is as follows: A window of length 1 feature vectors (1 Seconds) was extracted to the left and right of the possible boundary points, B. Labels for left segments, LSL and the right segments, RSL were estimated using GMMs (as shown in 3.2). Figure 3.2: GMM Labels Label smoothing using Domain Knowledge Now, using the set of possible boundaries (B) and their left and right segment labels (LSL and RSL) we need to assign the label for individual feature vector, L. The following approach was used to find L. 47

62 LSL[1], if 1 n B[1] RSL[k], if B[k] < n < (B[k] + B[k + 1])/2, (k = 1..K 1) L[n] = LSL[k], if (B[k 1] + B[k])/2 n B[k], (k = 2..K) RSL[K], if B[K] < n N The labels after applying the above approach is as shown in the Figure 3.3 Figure 3.3: Entire song label generated using GMM Domain information was used to improve the results. To agglomerate the labels, a smoothing algorithm was used as described below: An item can have utmost 2 segments-ālāpanā and kriti. If present, ālāpanā must be atleast 3 seconds long. Kriti may be preceded by ālāpanā, and not vice versa. If a smaller segment of a particular label (ālāpanā or kriti), was identified in between two larger segments of different label, then the smaller segment is relabelled and merged with the adjacent larger segments. Final song label will be as shown in the Figure

63 Figure 3.4: Entire song label generated using GMM after smoothing 3.3 Experimental Results Dataset Used Experiments were conducted on 4 live concert recordings. Of these 4 concerts, 6 were multi track recordings and the remaining were single track recordings. The details of the dataset used is given in Table 3.1. Durations are given in approximate hours (h). Table 3.1: Division of dataset. Male Female Total No. of artistes No. of Concerts No. of items with ālāpanā No. of Items without ālāpanā Total no. of items Total duration of kriti 3 h 18 h 48 h Total duration of ālāpanā 12 h 7 h 19 h 49

64 3.3.2 Results Experiments were performed using both MFCC and CFB based features. Two metrics were used to calculate the accuracy of segmentation frame-level accuracy and item classification accuracy. As mentioned earlier, a musical item in a concert can be, a kriti optionally preceded by an ālāpanā. Assuming that the ālāpana-kriti boundary was detected properly, item classification was pursued. Results using CFB features Table 3.2 shows the confusion matrix for the frame-level classification using CFB based feature. Table 3.3 shows the performance for the frame-level classification. Table 3.2: Confusion matrix: Frame-level labelling kriti ālāpanā kriti 1,64,11,759 7,77,84 ālāpanā 13,16,925 56,87,274 Table 3.3: Performance: Frame-level labelling kriti ālāpanā Precision Recall F-measure Accuracy.9134 Table 3.4 shows the confusion matrix for the item classification using CFB based feature. Table 3.5 shows the corresponding performance for the item classification. Table 3.4: Confusion matrix: Item Classification Without ālāpanā With ālāpanā Without ālāpanā With ālāpanā

65 Table 3.5: Performance: Item Classification Without ālāpanā With ālāpanā Precision Recall F-measure Accuracy.8816 Results using MFCC features Table 3.6 shows the confusion matrix for the frame-level classification using MFCC feature. Table 3.7 shows the performance for the frame-level classification. Table 3.6: Confusion matrix: Frame-level labelling kriti ālāpanā kriti 1,39,58,342 32,31,221 ālāpanā 52,6,214 17,43,985 Table 3.7: Performance: Frame-level labelling kriti ālāpanā Precision Recall F-measure Accuracy.649 Table 3.8 shows the confusion matrix for the item classification using MFCC feature. Table 3.9 shows the corresponding performance for the item classification. Table 3.8: Confusion matrix: Item Classification Without ālāpanā With ālāpanā Without ālāpanā With ālāpanā Discussions It can be observed that, using this approach, frame level labelling accuracy of 91.34% and item classification accuracy of 88.16% has been achieved using CFB 51

66 Table 3.9: Performance: Item Classification Without ālāpanā With ālāpanā Precision Recall F-measure Accuracy.5732 Energy feature. Whereas using the MFCC feature, frame level labelling accuracy of 64.9% and item classification accuracy of has been achieved. Accuracy of MFCC is low due to the common frequency range assumed in the feature extraction process. Also some of the recordings are not clean and in some cases, the ālāpana was very short. These have contributed to errors in classification. 52

67 CHAPTER 4 Segmentation of a kriti 4.1 Introduction In Carnatic music, a kriti or composition typically comprises of 3 segments-pallavi, anupallavi and caranam-although in some cases there can be more segments due to multiple caranam segments. While many artistes render only 1 caranam segment (even if the composition has multiple caranam segments), some artistes do render multiple caranam or all the caranams. The pallavi in a composition in Carnatic music is akin to the chorus or refrain in Western music albeit with a key difference; the pallavi (or part of it) can be rendered with a number of variations in melody, without any change in the lyrics, and is repeated after each segment of a composition. Segmentation and detection of repeating chorus phrases in Western music is a well researched problem. A number of techniques have been proposed to segment a Western music composition. While these techniques have been attempted for Western music where the repetitions have more or less static time-frequency melodic content, finding repetitions in improvisational music is a difficult task. In Indian music, the melody content of the repetitions varies significantly during repetitions within the same composition due to the improvisations performed by the musician. A musician s rendering of a composition is considered rich, if (s)he is able to improvise and produce a large number of melodic variants of the line while preserving the grammar, identity of the composition and the rāga. Further, the same composition when rendered by

68 different musicians can be sung in different tonics. Hence matching a repeating pattern of a composition across recordings of various musicians requires a tonicindependent approach. Segmentation of compositions is important both from the perspective of lyrics and melody. Pallavi, being the first segment, also plays a major role in presenting a gist of the rāga, which gets further elaborated in anupallavi and caranam. In the pallavi, a musical theme is initiated with key phrases of the raga, developed a little further in the anupallavi and further enlarged in the caranam, maintaining a balanced sequence - one built upon the other. Similar stage-by-stage development from lyrical aspect can also be observed. An idea takes form initially in the pallavi, which is the central lyrical theme, further emphasised in the anupallavi and substantiated in the caranam. Let us illustrate this with an example with the kriti of Saint Tyāgaraja. The central theme of this composition is why there is a screen between us. The lyrical meaning of this kriti is as below: Pallavi: Oh Lord, why this screen (between us)? Anu pallavi : Oh lord of moving and non-moving forms who has sun and moon as eyes, why this screen? Caranam: Having searched my inner recess, I have directly perceived that everything is You alone. I shall not even think in my mind of anyone other than You. Therefore, please protect me. Oh Lord, why this screen? The pallavi or a part of pallavi is repeated multiple times with improvisation for the following reasons: 1) The central lyrical theme that gets expressed in the pallavi is highlighted by repeating it multiple times, 2) the melodic aspects of the rāga 54

69 and the creative aspects of the artiste (or the music school) jointly get expressed by repetitions of pallavi. These improvisations in a given composition also stand out as signatures to identify an artiste or the music school. Since pallavi serves as a delimiter or separator between the various segments, locating the pallavi repetitions also leads to knowledge of the number of segments in a composition (>=3) as rendered by a certain performer. A commonly observed characteristic of improvisation of pallavi (or a part of it) is that for a given composition, a portion (typically half) of the repeating segment will remain more or less constant in melodic content through out the composition while the other portion varies from one repetition to another. For instance, if the first half of the repeating segment remains constant in melody, the second half varies during repetitions and vice-versa. This property is used to locate repetitions of pallavi inspite of variations in melody from one repetition to another. In this chapter, under section 4.2, we will discuss the algorithm used to segment a kriti. Then under section kritiexpresults, we will present the results of our experiments. We will conclude with discussions on our findings. 4.2 Segmentation Approach Overview The structure of a composition in Carnatic music is such that, the pallavi or part of it gets repeated at the end of anupallavi and caranam segments. Hence our overall approach is to use the pallavi or a part of it as a query to look for repetitions of the query in the composition and thereby segment the composition into pallavi, 55

70 anupallavi and caranam. In our initial attempts, the query was first manually extracted from 75 popular Carnatic music compositions. In 65 of these compositions, the lead artiste was a vocalist accompanied by a violin and one or more percussion instruments while in the remaining 1 compositions, an instrumentalist was the lead artiste accompanied by one or more percussion instruments. The pallavi lines were converted to time-frequency motifs. These motifs were then used to locate the repetitions of this query in the composition. Cent-filterbank based features were used to obtain tonic normalised features. Although the pallavi line of a composition can be improvised in a number of different ways with variations in melody, the timbral characteristics and some parts of the melodic characteristics of the pallavi query do have a match across repetitions. The composition is set to a specific tala (rhythmic cycle), and lines of a pallavi must preserve the beat structure. With these as the cues, given the pallavi or a part of it as the query, an attempt was made to segment the composition. The time-frequency motif was represented as a matrix of mean normalised cent filterbank based features. Cent filterbank based energies and slope features were extracted for the query and the entire composition. The correlation co-efficients between the query and the composition were obtained while sliding the query window across the composition. The locations of the peaks of correlation indicate the locations of the pallavi. We also attempted to extract the query automatically for all the compositions using the approach described in and cross-checked the query length with the manual approach. 56

4.2.2 Time Frequency Templates The spectrogram is a popular time-frequency representation.

1 shows spectrograms of the query and the matched and unmatched time-frequency segments of the same

One can see some similarity of structure between query and matched segments.

The frequency range is set appropriately to occupy about 6 octaves for any musician.

not evident. This is primarily because the motif is sung to a specific tonic.

Query 1 Unmatched Segment 35 3 8 Unmatched Segment 1 35 1 35 8 3 8 3 25 6 2 6 4 15 4 25 25 6 2 15 1 2 5

71 4.2.2 Time Frequency Templates The spectrogram is a popular time-frequency representation. The repeated line of a pallavi is a trajectory in the time-frequency plane. Fig. 4.1 shows spectrograms of the query and the matched and unmatched time-frequency segments of the same length in a composition using linear filterbank energies. One can see some similarity of structure between query and matched segments. Such a similarity of structure is absent between query and un-matched segments. The frequency range is set appropriately to occupy about 6 octaves for any musician. Although the spectrogram does show some structure, the motifs corresponding to that of the query are not evident. This is primarily because the motif is sung to a specific tonic. Therefore the analysis of a concert also crucially depends on the tonic. Query 1 Unmatched Segment Unmatched Segment Matched Segment Matched Segment Matched Segment Figure 4.1: Time-frequency template of music segments using FFT specturm (X axis: Time in frames, Y axis: Frequency in Hz) 57

72 The cent filterbank energies were computed for both the query and the composition. The time-dependent filterbank energies were then used as a query. Fig. 4.2 shows a time-frequency template of the query and some matched and unmatched examples from the composition. A sliding window approach was used to determine the locations of the query in the composition. The locations at which the correlation is maximum corresponds to matches with the query. Fig. 4.4 shows a plot of the correlation as a function of time. The location of the peaks in the correlation, as verified by a musician, correspond to the locations of the repeating query. Query Unmatched Unmatched Matched Matched Matched Figure 4.2: Time-frequency template of music segments using cent filterbank energies (X axis: Time in frames, Y axis: Filter) As mentioned earlier in 2.4.2, percussion strokes destroy the motif of the melody. So Cent filterbank slope features were also used as an alternate feature. Fig. 4.3 shows a plot of the time-dependent query based on filter bank slope and 58

73 corresponding matched and unmatched segments in the composition. One can observe that the motifs are significantly emphasised, while the effect of percussion is almost absent. Query 2 Unmatched Segment 2 Unmatched Segment Matched Segment 2 Matched Segment 2 Matched Segment Figure 4.3: Time-frequency template of music segments using cent filterbank slope (X axis: Time in frames, Y axis: Filter) 4.3 Experimental Results The experiments were performed primarily on Carnatic music, though limited experiments were done on other musical genres - Hindustani and Western music. For Carnatic music, a database of 75 compositions by various artistes was used. The database comprised of compositions rendered by a lead vocalist or lead instrumentalist, the instruments being flute, violin and veena. The tonic information was determined for each composition. Cent filterbank based energies and cent 59

74 filter bank based slope features were extracted for each of these compositions and used for the experiments. For every 1 millisecond frame of the composition, 8 filters were uniformly placed across 6 octaves (the choice of number of filters was experimentally arrived at to achieve the required resolution). The correlation between the query and the moving windows of the composition was computed Finding Match with a Given Query The query for each composition was extracted manually and the cent filterbank based features were computed. Then Algorithm 1 was used for both CFB based energy and slope features. Fig. 4.4 and Fig. 4.5 show correlation plots using CFB energy and slope features for the composition janani ninnu vinā. We can see that the identified repeating patterns clearly stand out among the peaks due to higher correlation. The spectrogram of the initial portion of the same composition with the query and the matching sections is shown in Fig x Pallavi Anu pallavi Correlation Threshold Ground Truth Caranam Correlation Time (in Seconds) Figure 4.4: Correlation as a function of time (cent filterbank energies) 6

14 x 1 6 12 1 8 Pallavi Anu pallavi Correlation Threshold Ground Truth Caranam Correlation 6 4 2 2 4 5 1 15 2 25 3 35 4 Time (in Seconds) Figure 4.

75 14 x Pallavi Anu pallavi Correlation Threshold Ground Truth Caranam Correlation Time (in Seconds) Figure 4.5: Correlation as a function of time (cent filterbank slope) Figure 4.6: Spectrogram of query and matching segments as found out by the algorithm. The experiments were repeated with MFCC, chroma features with and without overlapping filters. For MFCC, 2 co-efficients were extracted with 4 filters placed in the frequency range hz to 8 hz. The chroma filter-banks [12] used for Western classical music use non-overlapping filters as the scale is equi-temperament and hence is characterised by a unique set of 12 semitones, subsets of which are used in performances. Indian music pitches follow a just intonation rather than an equi-temperament intonation [44]. Even just intonation is also not adequate as shown in [28] because the pitch histograms across all rāgas of Carnatic music appear to be more or less continuous. To account for this, the chroma filter-banks with a set of overlapping filters was experimented in addition to chroma filter 61

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC

APPLICATIONS OF A SEMI-AUTOMATIC MELODY EXTRACTION INTERFACE FOR INDIAN MUSIC Vishweshwara Rao, Sachin Pant, Madhumita Bhaskar and Preeti Rao Department of Electrical Engineering, IIT Bombay {vishu, sachinp,