Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility

Recommending Music for Language Learning: The Problem of Singing Voice Intelligibility Karim M. Ibrahim (M.Sc.,Nile University, Cairo, 2016) A THESIS SUBMITTED FOR THE DEGREE OF MASTER OF SCIENCE DEPARTMENT OF COMPUTER SCIENCE NATIONAL UNIVERSITY OF SINGAPORE 2018 Supervisor: Associate Professor Ye Wang Examiners: Associate Professor Ng Teck Khim Associate Professor Huang Zhiyong

DECLARATION I hereby declare that this thesis is my original work and it has been written by me in its entirety. I have duly acknowledged all the sources of information which have been used in the thesis. This thesis has also not been submitted for any degree in any university previously. Signature: Date: 16 October 2018 Karim M. Ibrahim 2018 I

ACKNOWLEDGMENTS Firstly, I would like to express my sincere gratitude to my advisor Prof. Wang Ye for the continuous support of my study and related research, for his patience and motivation. His guidance helped me in both my research and my life through his helpful advice. My sincere thanks also goes to Dr. Kat Agres, Dr. David Grunberg, and Dr. Douglas Turnball. Without they precious support it would not be possible to conduct this research. I thank my fellow labmates in for the stimulating discussions, for the sleepless nights we were working together before deadlines, and for all the fun we have had in the last two years. Last but not the least, I would like to thank my family for supporting me spiritually throughout writing this thesis and my my life. II

Contents Abstract List of Publications List of Tables List of Figures List of Abbreviations V VI VII VIII IX 1 Introduction 1 1.1 Motivation............................ 1 1.2 Problem Statement....................... 2 1.3 Contributions.......................... 3 1.4 Thesis Outline.......................... 4 2 Literature Survey and Problem Identification 5 2.1 Factors affecting sung lyrics intelligibility........... 6 2.2 Relevance to Speech Intelligibility............... 8 2.3 Problem Statement....................... 9 3 Proposed System 10 3.1 Defining ground truth measure of intelligibility........ 10 3.2 The behavioral experiment and collecting the dataset.... 11 3.2.1 Lab Experiment..................... 11 3.2.2 Amazon Mechanical Turk Experiment........ 14 3.3 Investigating relevant acoustic features............ 17 III

CONTENTS CONTENTS 3.3.1 Preprocessing...................... 19 3.3.2 Audio features...................... 20 3.4 Building an acoustic model................... 23 4 Future Work 28 5 Conclusions 30 References 31 APPENDICES 35 A Transcription Surveys 36 B List of songs 39 IV

ABSTRACT Learning a new language is a complex task that requires time dedication and continuous learning. Language immersion is a recommended approach to maximize the learning by using the foreign language in daily activities. Research has shown that listening to music in a foreign language can enhance the listener s language and enrich his/her vocabulary. In this study, we investigate how to recommend songs that maximize the language learner benefit from listening to songs in the target languages. Specifically, we are proposing a method for annotating songs according to their intelligibility to human listeners. We then propose a number of acoustic features that measure the different factors affecting intelligibility in singing voice and use them in building a system to automatically estimate the intelligibility of a given song. Finally, we study the usability of crowdsourcing platforms for collecting and annotating songs according to their intelligibility score on a large scale. V

List of Publications Ibrahim, K. M., Grunberg, D., Agres, K., Gupta, C., and Wang, Y. Intelligibility of sung lyrics: A pilot study. Proc. The International Society of Music Information Retrieval (ISMIR) Oct 2017. VI

List of Tables 3.1 Comparison between lab-controlled and MTurk experiments in terms of cost, preparation time, and time to get results. The results show that MTurk is superior in all three categories 16 3.2 Classification accuracy for different genres.......... 26 B.1 List of songs used in the first lab experiment and model training. The dataset included five genres: Pop/Rock, Jazz, RnB, Folk, and Classical with 10 songs per genre....... 40 B.2 List of songs used in the second MTurk experiment. The dataset focused on less intelligible genres and included five genres: Metal, Tap, Electro, Reggae, and Punk with 10 songs per genre............................. 42 VII

List of Figures 3.1 The process of labeling songs with intelligibility score.... 13 3.2 The distribution of the transcription accuracies (Intelligibility score)............................. 14 3.3 The webpage setup for the Mturk experiment........ 15 3.4 Score obtained from lab-controlled experiment vs MTurk experiment............................. 16 3.5 Comparison between intelligibility score distribution across batches.............................. 18 3.6 Intelligibility Scores across different genres.......... 18 3.7 Confusion Matrix of the SVM output.............. 25 3.8 Confusion matrix of the different genres............ 25 A.1 Survey filled by the participants in the lab experiment... 37 A.2 Survey filled by the participants in the lab experiment... 38 VIII

List of Abbreviations VAR Vocal to Accompaniment Ratio HRR Harmonics to Residual Ratio HFE High Frequency Energy HFC High Frequency Component MTurk Amazon Mechanical Turk MFCC Mel-frequency cepstral coefficients IX

Chapter 1 Introduction It is a common practice for many individuals to listen to music on a daily basis. It has been shown that music affects the mood and the mental clarity [28]. Based on this, music has been used in various contexts to solve different problems. For example, music is used in health related application, e.g. music therapy for patients diagnosed with Parkinson s disease or terminal cancer [11, 38], improving education process [10] and improving the mood of customers while shopping[49]. In this work, we focus on using music to improve the process of learning a new language. We study the factors that would make a listener favor a song over the other in terms of its suitability for understanding the sung lyrics. We then propose a computational model to automatically estimate these factors. 1.1 Motivation Learning a foreign language is a complex task that receives much attention from the research community on how to facilitate the process. Language immersion is a common strategy to improve a foreign language by performing daily tasks in the target language, e.g. reading articles and watching movies. For a daily music listener who is learning a new language, selecting suitable songs in the target language can help enriching the vocabulary of the student and his familiarity with the language. Research 1

Chapter 1. Introduction has shown that singing and language development are closely related at the neurological level [35, 45], and experimental results have demonstrated that singing along with music in the second language is an effective way of improving memorization and pronunciation [21, 31]. However, specific songs are only likely to help these students if they can understand the content of the lyrics [19]. As second language learners may have difficulty in understanding certain songs in their second language due to their lack of fluency, they could be helped by a system capable of automatically determining which songs they are likely to find intelligible and match their level of fluency. Although singing voice analysis has received much attention from the research community, the problem of intelligibility of a given set of sung lyrics is not studied as much. 1.2 Problem Statement Intelligibility describes how easily a listener can comprehend the words that a performer sings. The lyrics of very intelligible songs can easily be understood, while the lyrics of less intelligible songs sound incomprehensible to the average listener. People s impressions of many songs are strongly influenced by how intelligible the lyrics are. One study even finding that certain songs were perceived as happy when people could not understand its lyrics, but was perceived as sad when the lyrics were made comprehensible[32]. It would thus be useful to enable systems to automatically determine intelligibility, as it is a key factor in people s perception of a wide variety of songs. Besides the intelligibility of the singing voice, other factors that affect the progress of language learning with music is the lyrics complexity and sentence structure. For beginner levels, using correct grammar and simple language is recommended to enrich their vocabulary. However, for an advanced level, listening to songs that contain more colloquial and less formal language is useful for reaching higher level of familiarity and cultural integration. Music is regarded as a gateway to understanding society s cul- 2

Chapter 1. Introduction ture, connecting with the current generation and understanding more about the cultural history of the countries speaking the foreign language. It is common for music to express and discuss society s issues and conditions. This is an important part of learning a new language. People who learn a new language are often interested in integrating in these foreign societies and connecting with people of the same age, a challenge that traditional classrooms often fail to resolve. In this study, we focus mainly on estimating the intelligibility of a given song. we define the system structure as follows: Inputs: - A target song (singing voice mixed with background music). - The corresponding lyric text (if available). Output: A score between 0 and 1 reflecting the intelligibility of the sung lyrics. 1.3 Contributions The focus of this thesis is investigating the problem of intelligibility as an essential first step in recommending music for language learning. After reviewing the factors that affect the intelligibility of singing voice based on the cognitive studies, we proceed to build a computational model for intelligibility estimation. Our main contributions can be summarized into: 1. We propose a reliable behavioral study to label songs according to their intelligibility. 2. We propose a set of acoustic and textual features that reflect the different factors that affect song s intelligibility. 3. We train a prediction model using the proposed features to automatically estimate the intelligibility of a given song. 3

Chapter 1. Introduction 4. We study the efficiency and accuracy of using crowdsourcing for intelligibility score annotation compared to a lab-controlled annotation. We conclude by proposing directions of future work that include other factors affecting language learning with music, such as lyrics complexity and grammar correctness, and their applications in building a music recommendation system for language learning. 1.4 Thesis Outline The thesis is structured as follows: Chapter 2 studies the existing literature as well as defines the problem to be investigated. In Chapter 3 we report our proposed system, which includes our labeling scheme, features, model training and the crowdsourcing approach for data annotation. A description of the future directions and plans are described in Chapter 4. Finally we conclude the thesis with a summary in Chapter 5. 4

Chapter 2 Literature Survey and Problem Identification The problem of recommending music for language learners has not been extensively studied in the literature. However, there have been some cognitive studies on the intelligibility and lyrics complexity, which are relevent ot this problem. In the following we cover some of the cognitive studies which we will use as basis for building our acoustic model. One of the primary factors in selecting suitable songs for language learners is the intelligibility of the sung lyrics. The fact that sung lyrics could be more difficult to comprehend than spoken words has long been established in the scientific community. For example, singing at high pitch significantly impairs the intelligibility of the sung lyrics and one study showed that even professional voice teachers and phoneticians had difficulty telling vowels apart when sung at high pitch [12]. Another study by Collister and Huron showed that sung lyrics causes hearing errors as much as seven times more frequent than spoken ones [3]. Such studies also noted lyric features which could help differentiate intelligible from unintelligible songs; for instance, one study noted that songs comprised mostly of common words sounded more intelligible than songs with less frequent words [17]. However, lyric features alone are not sufficient to assess intelligibility; the same lyrics can be rendered more or less intelligible depending on, for instance, the speed 5

Chapter 2. Literature Survey and Problem Identification at which they are sung. These other factors must be taken into account to truly assess lyric intelligibility. One feature of the singing voice that has been addressed by the research community is the overall quality of the singing voice. Some of the proposed methods of assessing singing voice quality in the literature have been shown to reliably distinguish between trained and untrained singers [2, 34, 47]. One acoustic feature which multiple studies have found to be useful for this purpose is the power ratio of frequency bands containing energy from the singing voice to other frequency bands. Additionally, calculation of pitch intervals and vibrato have also been shown to be useful for this purpose [33]. However, while the quality of singing voice may be a factor in assessing intelligibility, it is not the only such factor. Aspects of the song that have nothing to do with the skill of the singer or the quality of their performance, such as the presence of loud background instruments, can contribute, and additional features that take these factors into account are needed for a system which determines lyric intelligibility. Another related task is that of singing transcription, in which a computer must listen to and transcribe sung lyrics [29]. It may seem that one could assess intelligibility by comparing a computer s transcription of the lyrics to a ground truth set of lyrics and determining if the transcription is accurate. But this too does not really determine intelligibility, at least as humans perceive it. A computer can use various filters and other signal processing or machine learning tools to process the audio and make it easier to understand, but a human listening to the music will not necessarily have access to such tools. Thus, even if a computer can understand or accurately transcribe the lyrics of a piece of music, this does not indicate whether those lyrics would be intelligible to a human as well. 2.1 Factors affecting sung lyrics intelligibility Several cognitive studies has been conducted to identify the factors affecting the intelligibility of the sung lyrics. In the following we go through 6

Chapter 2. Literature Survey and Problem Identification these studies and list their findings on what factors are important for the problem in hand. These factors will be the basis on which we build our acoustic model to automatically assess the intelligibility similar to how listeners perceive it. One recent study investigates several factors that are assumed to affect the intelligibility, which are detailed below, and conducted a behavioral experiment to test these hypotheses [17]. Another study investigated the factors affecting intelligibility categorically as either performer-related factors, environment-related factors and listener-related factors [8]. Finally, in [5], the authors studied the intelligibility with focus on style and genres related factors and how different genres are more or less intelligible than others on average. Since the purpose of this work is to build an acoustic system to assess the intelligibility, we excluded the listener-related factors, such as hearing impairment. In the following, we list the different factors that were proven to affect the intelligibility which will be the basis for building our model: 1. Balance between singer(s) and accompaniment [8]. 2. Using common and familiar words in the language increases the intelligibility of the song [17]. 3. Melismas reduce the intelligibility of a song. Melismas are a a syllable that is sustained over several notes [17]. 4. When syllable stresses are aligned with musical stresses, the intelligibility increases[17]. 5. Repeating same words are across multiple phrases in the same song increases intelligibility [17]. 6. Genre and compositional style [5, 8]. Based on the above factors defined by cognitive studies, it is possible to measure this individual factors from an audio wave and build a model capable of estimating a song s suitability for a language learner. However, building such a system would require a labeled dataset with the ground truth reflecting the intelligibility and complexity levels. 7

Chapter 2. Literature Survey and Problem Identification 2.2 Relevance to Speech Intelligibility Speech and singing are naturally associated and share several similar problems. Hence, it is important to consider the speech intelligibility problem and whether the proposed approaches in the literature are relevant to singing intelligibility. It is clear that the factors that would make the speech unintelligible would also work in the case of singing. However, by reviewing the literature, we observe that most approaches are focusing on estimating speech intelligibility after being distorted (modified) by a transmission system. The purpose of these studies is to evaluate the system by measuring the quality and intelligibility of the speech after being transmitted. These systems require access to the original speech before being transmitted. By studying the literature, we find that there are mainly two approaches to quantify the intelligibility, one subjective and another objective [44]. Subjective measures estimates the intelligibility using a listening test [30]. Objective measures are focusing on how to computationally estimate the intelligibility of a given signal. Since many of these approaches were developed to measure the transmission quality of transmission systems, most depend on having the original signal as a reference.an example of objective measure is the Articulation Index (AI) [20]. The calculation of the articulation index is based on measuring the signal-to-noise ratio in the transmitted signal over a number of a specific bands. Additional approaches were proposed that build on top of the articulation index and as in [13, 43]. These approaches are the basis of similar recent measurement called the speech transmission index (STI) [14] which is standardized by IEC standard 60268-16 [4]. However, the articulation index and speech transmission index are focusing on the intelligibility for a transmission system and based mainly on the errors due to the transmission quality, rather than the speech-related factors. Additionally, it required having a clean version of the speech for comparison, which is not available in the case of singing. While adapting this measures to the case of singing is still an open area of research, we focus in this thesis of studying and measuring the intelligibility based on 8

Chapter 2. Literature Survey and Problem Identification the factors that are specific for the case of singing voice. 2.3 Problem Statement Our goal is to design a system to recommend songs for students learning a foreign language as part of their language immersion. The main research problem to be solved is to automatically estimate the intelligibility of the sung lyrics of a given song. Solving this problem will help in selecting the suitable songs with comprehensible sung lyrics matching the level of fluency of the user. To solve this problem, the challenges are: 1. Defining ground truth measure of intelligibility. 2. Collecting a dataset for this specific problem. 3. Investigating relevant acoustic features. 4. Building a predictive model to estimate the intelligibility using the selected features. Solving the problem of intelligibility estimation is sufficient to integrate this criteria in current recommendation systems so it would recommend songs that match the user s taste and also scores higher than a certain threshold depending on the user s fluency. However, this would only serve as an initial system where additional factors can be integrated afterwards. 9

Chapter 3 Proposed System In this chapter, we discuss the steps taken in solving the target problem. We specifically focus on estimating the intelligibility as an initial step of selecting songs suitable for language learners. We state our approach in solving each of the four main challenges introduced in the problem statement in the previous chapter. 3.1 Defining ground truth measure of intelligibility To build a system that can automatically process a song and evaluate the intelligibility of its lyrics, it is essential to gather ground truth data that reflects this intelligibility on average across different listeners. Hence, we conducted a study where participants were tasked with listening to short excerpts of music and transcribing the lyrics, a common task for evaluating intelligibility of lyrics [5]. The accuracy of their transcription can be used to assess the intelligibility of each excerpt. The experiment was initially conducted in a lab which required physical presence of the participants. In the next phase, we investigated the possibility of labeling our dataset using crowd sourcing platforms, specifically Amazon Mechanical Turk (MTurk), and verified its suitability and accuracy in this specific task as indicated in Section 3.2.2. 10

Chapter 3. Proposed System 3.2 The behavioral experiment and collecting the dataset In order to study the different acoustic and textual features useful in estimating the intelligibility and build an acoustic model to predict it, it is essential to have a reliable and well-labeled data. Using the method defined in Section 3.1, we collected a total of 200 excerpts across two phases. The first phase used a setup that required physical presence of the participants in the lab and was used to label 100 excerpts. The second verified the accuracy of using MTurk platform for this labeling task and then was used to label another 100 excerpts. 3.2.1 Lab Experiment Participants Seventeen participants (seven females and ten males) volunteered to take part in the experiment. Participants were between 21 to 41 years (mean = 27.4 years). All participants indicated no history of hearing impairment and that they spoke some English as a second language. Participants were rewarded with a $10 voucher for their time. Participants were recruited through university channels via posters and fliers. The majority of the participants were university students. Materials For the purpose of this study, we focused solely on English-language songs. Because one of the main applications for such a system is to recommend music for students who are learning foreign languages, we focused on genres that are popular for students. To identify these genres, we asked 48 university students to choose the 3 genres that they listen to the most, out of the 12 genres introduced in [5], as these 12 genres cover a wide variety of singing styles. The twelve genres are: Avante-garde, Blues, Classical, Country, Folk, Jazz, Pop/Rock, Rhythm and Blues, Rap, Reggae, Reli- 11

Chapter 3. Proposed System gious, and Theater. Because the transcription task is long and tiring for participants, we limited the number of genres tested to only five, from which we would draw approximately 45 minutes worth of music for transcription. We selected the five most popular genres indicated by the 48 participants: Classical, Folk, Jazz, Pop/Rock, and Rhythm and Blues. After selecting the genres, we collected a dataset of 10 songs per genre. Because we were interested in evaluating participants ability to transcribe an unfamiliar song, as opposed to transcribing a known song from memory, we focused on selecting songs that are not well-known in each genre. We approached this by selecting songs that have less than 200 ratings on the website Rate Your Music (rateyourmusic.com). Rate Your Music is a database of popular music where users can rate and review different songs, albums and artists. Popular songs have thousands of ratings while less known songs have few ratings. We used this criteria to collect songs spanning the 5 genres to produce our dataset. The songs were randomly selected, with no control over the vocal range or the singer s accent, as long as they satisfied the condition of being in English and having few ratings. Because transcribing an entire song, let alone 50 songs, would be an overwhelming process for the participants, we selected short excerpts from each song to be transcribed. Two excerpts per song were selected randomly such that each excerpt would include a complete utterance (e.g., no excerpts were terminated mid-phrase). Excerpts varied between 3 to 16 seconds in length (average = 6.5 seconds), and contained 9.5 words on average. The ground-truth lyrics for these songs were collected from online sources and reviewed by the experimenters to ensure they matched the version of the song used in the experiment. It is important to note that selecting short excerpts might affect intelligibility, because the context of the song (which may help in understanding the lyrics) is lost. However, using these short excerpts is essential in making the experiment feasible for the participants, and would still broadly reflect the intelligibility of the song. The complete dataset is composed of 100 excerpts from 50 songs, 2 excerpts per song, covering 5 genres, and 10 songs per genre. 12

Chapter 3. Proposed System A song with ground truth lyrics Human par5cipants transcribe the lyrics Compare with ground truth Intelligibility Score Figure 3.1: The process of labeling songs with intelligibility score Procedure We conducted the experiment in three group listening sessions. During each session, the participants were seated in a computer lab, and recorded their transcriptions of the played excerpts on the computer in front of them. The excerpts were played in randomized order, and each excerpt was played twice consecutively. Between the two playbacks of each excerpt there was a pause of 5 seconds, and between different excerpts a pause of 10 seconds, to allow the participants sufficient time to write their transcription. The total duration of the listening session is 46:59 minutes. Two practice trials were presented before the experimental trials began, to familiarize participants with the experimental procedure. Figure 3.1 shows the complete procedure of labeling one given song. Results and Discussion To evaluate the accuracy of the participants transcription, we counted the number of words correctly transcribed by the participant that match the ground truth lyrics. For each transcription by each student, the ratio between correctly transcribed words to the total number of words in the excerpt was calculated. We then calculated the average ratio for each excerpt across all 17 participants to yield an overall score for each excerpt between 0 and 1. This score was used to represent the ground-truth transcription accuracy, or Intelligibility score, for each excerpt. The distribution of Intelligibility scores in the dataset is shown in Figure 3.2. From the figure, we can observe that the intelligibility scores are biased towards higher values, i.e. there are relatively few excerpts with a low intelligibility score. This is caused by the restricted set of popular genres indicated by students, as certain excluded genres would be expected to have low intelligibility, such as 13

Chapter 3. Proposed System Distribution of intelligibility Score for 1st batch of dataset 3 Density 2 1 0 0.00 0.25 0.50 0.75 1.00 Intelligibility Score Figure 3.2: The distribution of the transcription accuracies (Intelligibility score). Heavy Metal. Not having a wide variance of intelligibility scores will affect our system s ability in learning. Hence, in the second phase of collecting the dataset, we focused on including the genres with less intelligibility that were not included in this phase. 3.2.2 Amazon Mechanical Turk Experiment One major drawback of the previous method is that it requires long time, physical presence of both participants and researchers and needs a space reserved for the whole time of the experiment. This makes performing the experiment on a large scale difficult. Hence, we investigated the possibility of using online crowdsourcing platform such as MTurk. Mturk is an online platform that enables individuals and employers to coordinate and perform human intelligence tasks, known as HITs, in return for a payment. This will resolve the problems of requiring physical presence of both parties and the need of reserved space for the experiment. However, we need to investigate whether it will save time and if it will produce results with high accuracy that correlates with the results in the controlled lab. Other studies have been conducted to validate MTurk s in speech transcription [26], however, to our knowledge, there is no such 14

Chapter 3. Proposed System Figure 3.3: The webpage setup for the Mturk experiment studies for lyrics transcription and its use in estimating intelligibility score. MTurk setup The webpage interface is the main setup for the MTurk experiment. We designed it in a way that delivers all the required information to the user, while replicating the experiment conducted in the lab, for comparison and validation purposes. Figure 3.3 shows the webpage interface for the participants. The setup is composed of three main parts: instructions, playback and transcription, and a short survey of participants age, gender, musical experience and favorite genres. We priced a single HIT with 0.01 US Dollars. We limited the time of the HIT to maximum 2 minutes and allowed the excerpt to be played maximum two times, same as the lab experiment. Results and Discussion The initial part of the experiment was conducted using the same 100 excerpts used in the previous lab experiment. We asked for 17 transcriptions per excerpt, same as the lab experiment. For a total number of 1700 HITs, the whole experiment was completed in one week. Table 3.1 shows 15

Chapter 3. Proposed System Intelligibility Scores from Lab 1.00 0.75 0.50 0.25 0.00 Intelligibility Score collected in Lab vs MTurk 0.00 0.25 0.50 0.75 1.00 Intelligibility Scores from Mturk Figure 3.4: Score obtained from lab-controlled experiment vs MTurk experiment a comparison between lab and MTurk experiments. Figure 3.4 shows the scores obtained from both the lab and MTurk experiments. The figure shows that the results correlate with each other with a Pearson s correlation of 0.79. Hence, it is safe to assume that MTurk is a reliable platform to label such datasets with our required criteria while being more economic and requiring shorter time. Method Lab MTurk Cost 170$ 41$ Preparation time Time to get results 2 weeks to find participants 3 one-hour sessions across two weeks 1 week to get familiar with MTurk and setup the webpage (Needed only once) one week Table 3.1: Comparison between lab-controlled and MTurk experiments in terms of cost, preparation time, and time to get results. The results show that MTurk is superior in all three categories Extending the dataset using MTurk After validating the reliability of using MTurk for labeling songs with intelligibility score, we extended our dataset with additional 100 excerpts. 16

Chapter 3. Proposed System The motivation is to enlarge the dataset and to balance the between high and low intelligible songs. As previously shown in Figure 3.2, the first batch of the dataset is skewed towards high intelligible songs, due to using certain intelligible genres. In the second batch of data collection, we focused on using genres that are known to be less intelligible to balance the distribution of intelligibility score across the dataset. For the 2nd batch, we collected another 100 excerpts of these genres: Metal, Punk, Rap, Reggae, Electro. As shown in Figure 3.5, after the 2nd batch included more excepts that have low intelligibility, which balanced the skewed scores in the first batch. Having a dataset that is equally distributed is important for the model training and generalization. Figure 3.6 shows the distribution of intelligibility scores across different genres. We can see how the new five extra genres have a lower intelligibility scores on average to balance the dataset. The results also broadly agree with the results from [5] which was a lab-controlled experiment. Additionally validating the scores collected from MTurk. 3.3 Investigating relevant acoustic features The purpose of this study is to select audio features that can be used to build a system capable of 1) predicting the intelligibility of song lyrics, and 2) evaluating the accuracy of these predictions with respect to the ground truth gathered from human participants. In the following approach, we analyze the input signal and extract expressive features that reflect the different aspects of an intelligible singing voice. Several properties may contribute to making the singing voice less intelligible than normal speech. One such aspect is the presence of background music, as accompanying music can cover or obscure the voice. Therefore, highly intelligible songs would be expected to have a dominant singing voice compared with the accompanying music [5]. Unlike speech, the singing voice has a wider and more dynamic pitch range, often featuring higher pitches in soprano vocal range. This has been shown to affect the intelligibility of the songs, 17

Chapter 3. Proposed System Distribution of intelligibility Score for 1st batch of dataset Distribution of intelligibility Score for 2nd batch of dataset 3 3 Density 2 1 Density 2 1 0 0.00 0.25 0.50 0.75 1.00 Intelligibility Score (a) The distribution of the Intelligibility score of the 1st batch 2.5 0 0.00 0.25 0.50 0.75 1.00 Intelligibility Score (b) The distribution of the Intelligibility score of the 2nd batch Distribution of intelligibility Score for complete dataset 2.0 Density 1.5 1.0 0.5 0.0 0.00 0.25 0.50 0.75 1.00 Intelligibility Score (c) The distribution of the Intelligibility score of the full dataset Figure 3.5: batches Comparison between intelligibility score distribution across Intelligibility Scores per Genre 1.00 0.75 Intelligibility Score 0.50 0.25 0.00 Metal Punk Reggae Rap ClassicalElectroPop rock Jazz RnB Folk Genre Figure 3.6: Intelligibility Scores across different genres 18

Chapter 3. Proposed System especially with respect to the perception of sung vowels [3, 1]. An additional consideration is that in certain genres, such as Rap, singing is faster and has a higher rate of words per minute than speech, which can reduce intelligibility. Furthermore, as indicated in [18], the presence of common, frequently occurring words helps increase intelligibility, while uncommon words decrease the likelihood of understanding the lyrics. In our model, we aimed to include features that express these different aspects to determine the intelligibility of song lyrics across different genres. These features are then used to train the model to accurately predict the intelligibility of lyrics in the dataset, based on the ground truth collected in our behavioral experiment. This part of the study was conducted on the initial 100 excerpts before the dataset extension using MTurk. 3.3.1 Preprocessing To extract the proposed features from an input song, two initial steps are required: separating the singing voice from the accompaniment, and detecting the segments with vocals. To address these steps, we selected the following approaches based on current state-of-the-art methods: Vocals Separation Separating vocals from accompaniment music is a well-known problem that has received considerable attention in the research community. Our approach makes use of the popular Adaptive REPET algorithm [25]. This algorithm is based on detecting the repeating patten in the song, which is meant to represent the background music. Separating the detected pattern leaves the non-repeating part of the song, meant to capture the vocals. Adaptive REPET also has the advantage of discovering local repeating patterns in the song over the original REPET algorithm [37]. Choosing Adaptive REPET was based on two main advantages: The algorithm is computationally attractive, and it shows competitive results compared to other separation algorithms, as shown in the evaluation of [23]. 19

Chapter 3. Proposed System Detecting Vocal Segments Detecting vocal and non-vocal segments in the song is an important step in extracting additional information about the intelligibility of the lyrics. Various approaches have been proposed to perform accurate vocal segmentation, however, it remains a challenging problem. For our approach, we implemented a method based on extracting the features proposed in [24], then training a Random Forest classifier using the Jamendo corpus 1 [39]. The classifier was then used to binary classify each frame of the input file as either vocals or non-vocals. 3.3.2 Audio features In this section, we investigate the set of features we used in training the model for estimating lyrics intelligibility. We use a mix of features reflecting specific aspects of intelligibility plus common standard acoustic features. The selected features are: 1. Vocals to Accompaniment Music Ratio (VAR): Defined as the energy of the separated vocals divided by the energy of the accompaniment music. This ratio is computed only in segments where vocals are present. This feature reflects how strong the vocals are compared to the accompaniment. High VAR suggests that vocals are relatively loud and less likely to be obscured by the music. Hence, higher VAR counts for higher intelligibility. This feature is particularly useful in identifying songs that are unintelligible due to loud background music which obscures the vocals. 2. Harmonics-to-residual Ratio (HRR): Defined as the the energy in a detected fundamental frequency (f0) according to the YIN algorithm [6] plus the energy in its 20 first harmonics (a number chosen based on empirical trials), all divided by the energy of the residual. This ratio is also applied only to segments where vocals are present. 1 http://www.mathieuramona.com/wp/data/jamendo/ 20

Chapter 3. Proposed System Since harmonics of the detected f0 in vocal segments are expected to be produced by the singing voice, this ratio, like VAR, helps to determine whether the vocals in a given piece of music are stronger or weaker than the background music which might obscure it. 3. High Frequency Energy (HFE): Defined as the sum of the spectral magnitude above 4kHz, HF E n = N b /2 k=f 4k a n,k (3.1) where a n,k is the magnitude of block n and FFT index k of the short time Fourier transform of the input signal, f 4k is the index corresponding to 4 khz and N b is the FFT size [16]. We calculate the mean across all frames of the separated and segmented vocals signal, as we are interested in the high energy component in vocals and not the accompanying instruments. We get a scalar value per input file reflecting high frequency energy. Singing in higher frequencies has been proven to be less intelligible than music in low frequencies [3], so detection of high frequency energy can be a useful clue that such vocals might be present and could reduce the intelligibility of the music, such as frequently happens with opera music. 4. High Frequency Component (HFC): Defined as the sum of the amplitudes and weighted by the frequency squared, HF C n = N b /2 k=1 k 2 a n,k (3.2) where a n,k is the magnitude of block n and FFT index k of the short time Fourier transform of the input signal and N b is the FFT size [27]. This is another measure of high frequency content. 5. Syllable Rate: Singing at a fast pace while pronouncing several syllables over a short period of time can negatively affect the 21

Chapter 3. Proposed System intelligibility[7]. In the past, Rao et al. used temporal dynamics of timbral features to separate singing voice from background music [40]. These features showed more variance over time for singing voice, while being relatively invariant to background instruments. We expect that these features will also be sensitive to the syllable rate in singing. We use the temporal standard deviation of two of their timbral features: sub-band energy (SE) in the range of ([300-900 Hz]), and sub-band spectral centroid (SSC) in the range of ([1.2-4.5 khz]), defined as SSC = khigh k=k low f(k) X(k) khigh k=k low X(k) (3.3) SE = k high k=k low X(k) 2 (3.4) where f(k) and X(k) are frequency and magnitude spectral value of the k th frequency bin, and k low and k high are the nearest frequency bins to the lower and upper frequency limits on the sub-band respectively. According to [40], SE enhances the fluctuations between voiced and unvoiced utterances, while SSC enhances the variations in the 2 nd, 3 rd and 4 th formants across phone transitions in the singing voice. Hence, it is reasonable to expect high temporal variance of these features for songs with high syllable rate, and vice versa. Thus, this feature is able to differentiate songs with high and low syllable rates. We would expect that very high and very low syllable rates should lead to low intelligibility score, while rates in a similar range to that of speech should result in high intelligibility score. 6. Word-Frequency Score: Songs which use common words have been shown to be more intelligible than those which use unusual or obscure words [18]. Hence, we calculate a word-frequency score for the lyrics of the songs as an additional feature. This feature is a nonacoustic feature that is useful in cases where the lyrics of the song are available. We calculate the word-frequency score using the wordfreq 22

Chapter 3. Proposed System open-source toolbox [42] which provides an estimates of the frequencies of words in many languages. 7. Tempo and Event Density: These two rhythmic features reflect how fast the beat and rhythm of the song are. Event density is defined as the average frequency of events, i.e., the number of note onsets per second. Songs with very fast beats and high event density are likely to be less intelligible than slower songs, since the listener has less time to process each event before the next one begins. We used the MIRToolbox[22] to extract these rhythmic features. 8. Mel-frequency cepstral coefficients (MFCCs): MFCCs approximates the human auditory system s response more closely than the linearly-spaced frequency bands [36]. MFCCs have been proven to be effective features in problems related to singing voice analysis [41], and so were considered as a potential feature here as well. For our system, we selected the 17 first coefficients (excluding the 0th) as well as the deltas of those features, which proved empirically to be the best number of coefficients. The MFCCs are extracted from the original signal without separation, as it reflects how the whole song is perceived. By extracting this set of features for an input file, we end up with a vector of 43 features to be used in estimating the intelligibility of the lyrics in this song. 3.4 Building an acoustic model We used the dataset and ground-truth collected in our behavioral experiment to train a Support Vector Machine model to estimate the intelligibility of the lyrics. To categorize the intelligibility to different levels that would match a language student s fluency level, we divided our dataset to three classes: High Intelligibility: excerpts with transcription accuracy of greater 23

Chapter 3. Proposed System than 0.66. Moderate Intelligibility: excerpts with transcription accuracy between 0.33 and 0.66 inclusive. Low Intelligibility: excerpts with transcription accuracy of less than 0.33. Out of the 100 samples in our dataset, 43 are in the High Intelligibility class, 42 are in the Moderate Intelligibility class, and the remaining 15 are in the Low Intelligibility class. For this pilot study, we tried a number of common classifiers, including Support Vector Machine (SVM), random forest and k-nearest neighbors. Our trials for finding a suitable model led to using SVM with a linear kernel, as it is an efficient, fast and simple model which is suitable for this problem. Finally, as a preprocessing step, we normalize all the input feature vectors before passing them to the model to be trained. Model Evaluation Because this problem has not been addressed before in the literature, and it is not possible to perform evaluation using other methods, we based our evaluation on classification accuracy from the dataset. Given the relatively small number of samples in the dataset, we used leave-one-out crossvalidation for evaluation. To evaluate the performance of our model, we compute overall accuracy, as well as the Area Under the ROC Curve (AUC). We scored AUC of 0.71 and accuracy of 66% with the aforementioned set of features and model. The confusion matrix of validating our model using leave-one-out cross-validation on our collected dataset is shown in Figure 3.7. The figure shows that the classifier has relatively more accuracy in predicting high and moderate than low intelligibility, which is often confused with the moderate class. Given that our findings are based on a relatively small segment of excerpts with low intelligibility, the classifier was found to be trained to work better on the high and moderate excerpts. Following model evaluation on the complete dataset, we were interested in investigating how the model performs on different genres, specifically how 24

Chapter 3. Proposed System Figure 3.7: Confusion Matrix of the SVM output. Confusion Matrix of Rock Genre High 4 2 3 Moderate 1 7 0 Low 1 1 1 High Moderate Low Confusion Matrix of Folk Genre High 6 3 0 Moderate 3 5 0 Low 1 2 0 High Moderate Low Confusion Matrix of R&B Genre Confusion Matrix of Jazz Genre High 4 0 0 High 7 3 0 Moderate 5 6 1 Moderate 3 5 0 Low 2 1 1 Low 0 2 0 High Moderate Low High Moderate Low Figure 3.8: Confusion matrix of the different genres 25

Chapter 3. Proposed System Genre Classification Accuracy Pop/Rock 60% R&B 55% Classical 70% Folk 55% Jazz 60% Table 3.2: Classification accuracy for different genres it performs when tested with a genre that was not included in the training dataset. This would imply how the model generalizes when running on different genres that was not present during training, as well as showing how changing genres affect classification accuracy. We performed an evaluation where we trained our model using 4 out of the 5 genres in our dataset, and tested it on the 5th genre. The classification accuracy across different genres is shown in Table 3.2. The results show variance in classifying different genres. For example, Classical music receives higher accuracy, while genres as Rhythm and Blues and Folk shows less accuracy. By analyzing the confusion matrices of each genre shown in Figure 3.8, we found that the confusion is mainly between high and moderate classes. By reviewing the impact of the different features on the classifier performance, we looked into what features have the biggest impact using the attribute ranking feature in Weka [48]. We found that several MFCCs contribute most in differentiating between the three classes, which we interpret to be due to analyzing the signal in different frequency sub-bands incorporates perceptual information of both the singing voice and the background music. This was followed by the features reflecting the syllable rate in the song, because singing rate can radically affect the intelligibility. Vocals-to- Accompaniment Ratio and High Frequency Energy followed in their impact on differentiating between the three classes. The features that had the least impact were the tempo and event density, which does not necessarily reflect the rate of singing. For further studies on the suitability of the features in classifying songs with very low intelligibility, the genres pool can be extended to include 26

Chapter 3. Proposed System other genres with lower intelligibility, rather than being limited to the popular genres between students. Further studies can also include the feature selection and evaluation process: similar to the work in [46], deep learning methods may be explored to select the features which perform best, rather than hand-picking features, to find the most suitable set of features for this problem. It is possible to extend the categorical approach of intelligibility levels to a regression problem, in which the system evaluates the song s intelligibility with a percentage. Similarly, certain ranges of the intelligibility score can be used to recommend songs to students based on their fluency level. 27

Chapter 4 Future Work In its current state, our work covers a pilot approach for estimating the intelligibility of the singing voice. However, the broader problem is to recommend songs for students who are learning a foreign language. Hence, Future work would include an extended studies for estimating the intelligibility of the singing voice, analyzing the lyrics in terms of complexity and grammatical structure, and use the proposed scores in recommending music to language students based on their fluency level. Regarding the possible directions of future work on the intelligibility score, it could include: 1. Investigating the effectiveness of using baseline acoustic and textual features to expand current feature set. 2. expanding the dataset to allow using approaches that require largescale datasets. Using the validated MTurk labeling scheme, the challenge is to find a way to collect the ground truth lyrics and align them with the excerpts using methods from the literature, e.g. [15, 9]. 3. With a large-scale dataset, we can investigate approaches of deep learning, such as convolutional neural networks, that perform feature extraction instead of using hand-picked features. Further work on lyrics complexity would investigate the grammatical structure of the lyrics and its correctness, to avoid recommending songs 28