Research Projects. Measuring music similarity and recommending music. Douglas Eck Research Statement 2

Research Statement Douglas Eck Assistant Professor University of Montreal Department of Computer Science Montreal, QC, Canada Overview and Background Since 2003 I have been an assistant professor in the University of Montreal Department of Computer Science. My training is in machine learning and cognitive science (Ph.D. in Computer Science & Cognitive Science, Indiana University, 2000). I design machine learning algorithms in the domain music. I have also done research in computational models of musical rhythm and meter. Primarily I am interested in bringing knowledge about human music perception and production as well as knowledge about music structure to bear on the task of computational modeling and algorithm design in the domain of music. Representative tasks include large-scale music recommendation, low-dimensional representations of music audio for similarity, beat tracking and synchronization with music audio, expressive music performance and music generation. My research is multidisciplinary. In addition to publishing in my primary domain of machine learning (NIPS, Machine Learning, Neural Computation, Neural Networks), I have also published in psychology and cognitive science (Music Perception, Psychology Research), signal processing (IEEE Sig Proc, EURASIP Journal Appl. Sig. Proc.), music information retrieval (ISMIR), and music technology (Journal of New Music Research). I have a long history of collaboration. I have active collaborations with fellow machine learning faculty at the University of Montreal (Yoshua Bengio, Balázs Kégl, Pascal Vincent). I work closely with psychologists and neuroscientists from the Brain Music and Sound (BRAMS) research center. This includes projects between myself and my students and the labs of Caroline Palmer (McGill psychology), Isabelle Peretz (University of Montreal psychology) and Robert Zatorre (Montreal Neurological Institute). Via McGill s Center for Interdisciplinary Research in Music Media and Tech (CIRMMT), I collaborate with the music and information retrieval group of Ichiro Fujinaga. I work closely with students and enjoy my role as lab leader. I currently advise or co-advise seven graduate students. Most of my publications are co-authored by one of my students. I offer my students a wide range of research options and try to connect them with other researchers in the field doing similar work. To illustrate, I currently have students working on projects such as sequence generation algorithms trained on music scores (symbolic sequence learning), expressive piano performance (cognitive psychology, motion capture), audio chord recognition (audio signal processing, matrix factorization), music similarity and recommendation (dimensionality reduction, classification).

Douglas Eck Research Statement 2 I also enjoy working with industry partners. Recently I participated in two technology transfer projects related to music recommendation. In the first project Dan Levitin (McGill psychology), Yoshua Bengio (University of Montreal Computer Science) and myself developed the core technologies for a Montreal music recommendation site www.radiolibre.ca. In the second project I worked on similar but larger-scale music recommendation project with Sun Labs, Boston (described below). The value of industry collaboration in the domain of machine learning is self-evident: one is forced to design algorithms that actually perform well on real-world problems. Additionally such contacts are useful for finding jobs for students who do not want to continue in academia. For example my Master s student Norman Casagrande is now a machine learning researcher at the popular web music recommendation site last.fm in London. In the future I am interested in expanding ties with video game developers (context-aware music generation) and developers of online communities such as Second Life (immersive musical environments). Research Projects Measuring music similarity and recommending music Representative publications: D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green. Automatic generation of social tags for music recommendation. In Neural Information Processing Systems Conference (NIPS) 20, 2007. J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl. Aggregate features and AdaBoost for music classification. Machine Learning, 65(2-3):473 484, 2006. Presume that you want to measure how similar are two songs based solely on their acoustic content. If you could do this in a sophisticated manner it would be possible to recommend new music to listeners based on how their favorite songs sound. One place to start is to represent each song as a vector such that similar songs are near-neighbors in this vector space. Similarity can then be measured using a vector distance measure such as cosine distance. Unfortunately it is difficult to construct a vectorial representation that is compact enough to yield meaningful vector distances and yet still descriptive of the music contained in the song. The root of the challenge lies in the high dimensionality of music. Nearly 1 million integer values are required to represent 10 seconds of CD-quality stereo audio (44, 100 16-bit samples per second second for each stereo channel = 882, 000) yielding almost 16 million values for a three minute song. It is of course possible to downsample and otherwise transform audio in order to reduce this dimensionality. For example, in speech recognition, Fourier-based transforms are commonly used to generate representations containing on the order of 200 values for every second of audio (e.g. 20 Mel-Frequency Cepstral Coefficients or MFCCs calculated at 100ms intervals). But even then our three minute song is encoded using 36, 000 values, too many for effective vector-based similarity in this domain. In addition to this challenge of high-dimensional inputs, there is also the problem of how people judge music similarity. First, we attend to different timescales when measuring similarity. For example we might cluster music based on rhythm and tempo cues (e.g. I want music to jog to ) encoded in timescales

Douglas Eck Research Statement 3 between roughly.5hz and 5Hz. Or we might instead pa attention to timbral cues (e.g. I want to hear accordion music ) encoded in much faster timescales. Second, our criteria for judging song similarity change based on context. For example one might add the Beatles song Taxman to a jogging mix because it has an upbeat tempo. In another context one might select the same song for different reasons. Perhaps the album Revolver is put into a dinner party mix because it is good background for afterdinner conversation. For these reasons it seems clear that standard signal processing tricks are not enough to yield a good measure of music similarity. I have worked on several projects which address issues related to music recommendation and music similarity. With Yoshua Bengio (Machine Learning, University of Montreal) and Dan Levitin (Music Psychology, McGill) I worked on a research project funded by a Canada NSERC I2I Innov industrial partnership grant. The goal of the project was to build a hybrid collaborative filtering / content-based filtering music recommender. Though our technology performed relatively well, the project was not pursued by the industrial partner past our one year involvement. With students James Bergstra and Norman Casagrande, I competed in two contests at MIREX, an international programming contest for music information retrieval (MIR). The contests involved predicting, respectively, the genre and the artist of audio waveforms. Our approach used a multi-class version of the ensemble learning algorithm AdaBoost. The model automatically ranked the audio features, allowing us to work with a large feature set and to iteratively discard unpredictive features when the model did not select them. The model finished first among 15 entries for genre recognition [2] and second among 10 entries for artist recognition. See [3] for more details. In January 2007 I took a six-month leave from the University of Montreal and became a visiting professor at Sun Labs, Sun Microsystems Boston where I worked on the Search Inside the Music (SITM), a project to build an industrial-scale music recommender. My role was to lead the machine learning part of the project. We devised a data mining method for collecting large numbers of music-related tags assigned by listeners to songs and artists on social music networks (e.g. Pandora or Last.fm). We then learned to predict these tags from audio. The vector of predicted tags can be used as a measure of inter-song or inter-artist similarity and can thus be used to recommend new or otherwise untagged music. In [10] we present results from experiments on a social tag database of millions of tags applied to 100, 000 artists and an audio database of 90, 000 songs spanning many of the more popular of these artists. Future Work One interesting area of further investigation is that of using non-supervised dimensionality reduction techniques to construct a low-dimensional music space using features extracted from digital audio. For example, in Figure 1 see the results of finding the shortest path through a nearest neighbor graph. The graph was built using similarity distances from the SITM project [10]. A musical version of this example is available at www.iro.umontreal.ca/ eckdoug/tagspace I am also exploring alternative interfaces for browsing the space of similar songs such as immersive environments where listeners can change an audio stream by moving around a space constructed using graphical hints which indicate how certain movements will change the music.

Douglas Eck Research Statement 4 E)-.'I5/1'!"#$%&'()*'+,,-./(,*'-/'7.,'E5/#%&B 3"-/-)& -5%@J./@ 6B*-.@/@ 6/"2 6%*&,5J6/*&$5%-,5 5/:H 5,&&), 5)@ @"*H @5/&5,66%(,5/:H @5/&5,66%(,1,-)2 @/@ @%)*/ *,$$)(, 1,-)2 K)LL %*6-5"1,*-)2 %*#%,.%@J./@ &5"*&, &/-.%: I"*H I/2H I,1)2,(/:)2%6-6,4@,5%1,*-)2,1/ #,)-.1,-)2 #)*:, :/"*-5B :2)66%:5/:H :2)66%:)2 :.%22/"- C5%-@/@ C5%-%6. C2",6 )1C%,*- ):/"6-%: FG6!"#$%&'()*'+,,-./(,* 01)*",2'34 +,)"4'35-6'75%/ 3*/")5'+5).,1 7./1)6'8,$1)* 9%:.),2';%)::.%*/ 7.,'<,)#'7,4)* ;)2,5%,'=-5)-%>", 35-%6-!/6:%2 <,)#'?/1@/6,56 A/B'?)1,1C,5 D5C%-)2 0)-'=-)-%: 7.,'E5/#%&B PNM QNR GNF GNG JGNF JQNR JPNM JONP JMNG Figure 1: Shortest path through tag space from Beethoven to hardcore/industrial band The Prodigy calculated using the SITM AdaBoost model (see 2007 NIPS paper [10]). Music sequence generation Representative publications: H. Jaeger and D. Eck. Can t get you out of my head: A connectionist model of cyclic rehearsal. In Modeling Communications with Robots and Virtual Humans, LNCS. Springer-Verlag, 2007. To appear. J. Paiement, D. Eck, S. Bengio, and D. Barber. A graphical model for chord progressions embedded in a psychoacoustic space. In ICML 05: Proceedings of the 22nd international conference on Machine learning, pages 641 648, New York, NY, USA, 2005. ACM Press. D. Eck and J. Schmidhuber. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In H. Bourlard, editor, Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, pages 747 756, New York, 2002. IEEE. Despite significant advances in our understanding of how people perceive and create music, we still suffer from a relatively primitive set of algorithms for working with music. There are indeed good synthesis algorithms for the realistic simulation of acoustic and electronic instruments. There are also effective methods for editing, combining and embellishing pieces of music. What is lacking, I believe, are algorithms for generating interesting new sequences of music, either alone as software composers or in real-time collaboration with musicians as software improvisers. Automatic music composition algorithms like these would serve two audiences. They would improve the tools already available to tech-savvy musicians while at the

Douglas Eck Research Statement 5 same time they would involve a new group of users, those who enjoy working with music but who lack the music-theoretic skills to compose. In addition such tools may be useful for generating music based on the changing context of a video game or film. Though all of these areas have received considerable attention, especially in traditional artificial intelligence (for example the work of David Cope [5]), fundamental challenges remain. Music sequences are difficult to learn using standard time-series analysis techniques. This is due in part to the fact that musical structure is elaborated over long time spans and is hierarchical, making it difficult or impossible to learn from local information. As an example compare the phrase John ate the [?] to the musical sequence B C E [?] In the linguistic case, the words John ate the constrain the possible values of next word [?]. In the musical case, the nearby notes B C and E do little to constrain what can come next. This is not to say that music lacks constraints. In music the constraints come from elaboration and repetition within the context of musical structure. Fortunately (for us), listeners are able to find this structure easily. For example, children old enough to jump rope can learn songs containing repeated variations of a musical phrase. Moreover the phrase can be relatively long, usually spanning several measures of music. Unfortunately (for standard sequence learning tools) this illustrates that music is elaborated over long time spans and at different timescales in parallel. Furthermore, wholesale repetition of sequences or modifications of sequences is common. Together, this leads to a difficult sequence learning task for standard timeseries models such as recurrent neural networks, hidden Markov models, ARMA models, etc. 100 95 90 85 80 Probabilistic piano roll 1 0.9 0.8 0.7 0.6 Time 75 70 65 60 55 0.5 0.4 0.3 0.2 0.1 50 10 20 30 40 Notes Figure 2: A piano roll representation showing note probabilities. Densities are taken from the transitions learned by an LSTM recurrent network trained on folk music. (Graphic from manuscript currently under preparation with graduate student Jasmin Lapalme.) One promising method for composing music automatically is to use a recurrent neural network (RNN) to capture the long-timescale regularities that constitute musical style. An RNN is a neural network with

Douglas Eck Research Statement 6 self-connected units. These self-connections allow an RNN to represent context and thus to discover correlations that span long time-lags in a sequence. Unfortunately, the gradient descent learning used in traditional RNNs leads to exponential decay of vital error information [1], greatly limiting in practice the kinds of sequence learning problems RNNs can handle. A potential solution to this problem lies in the hybrid RNN Long Short-Term Memory (LSTM) [12]. In previously published research done with one of LSTM s developers, I used LSTM to create an algorithm for learning to improvise music in a bebopjazz style [9]. Unlike competing methods, my LSTM composer could learn to be a stable generator of stylistically-correct music. Another promising method lies in generative graphical models. In recent work with PhD student Jean- François Paiement and co-supervisor Samy Bengio (Google) we developed a hierarchical graphical model that can generate new chord voicings for jazz standards and can also generate melodies from chord voicings [17, 16]. By modeling distance patterns, this approach can also be extended to musical rhythm [18]. Finally I have been collaborating with Herbert Jaeger from Jacobs University Bremen. We have been applying his Echo State Networks to the task of music sequence generation. Early experiments on the learning and rehearsal of cyclic patterns were promising [13] but much more work is needed to know whether this approach will scale from short cyclic patterns to the more deeply-structured music sequences. Future Work All of the approaches I have tried have yielded partial success but no machine learning algorithm has been a clear winner. (This is not surprising given the complexity of music). One direction I continue to explore is to represent metrical hierarchy either as a feature in the input sequence or in the structure of the learner. By having access to such structure a model is able to know where to look in the sequence for repetition boundaries and, more generally, for correlations spanning long timescales. Also, in certain domains it may be possible to solve a simpler problem than the general problem of composing novel musical sequences by example. In the case of video games, for example, a simpler yet still useful sequence generation task might be to adapt existing music compositions to meet specific contexts in a video game. That is, the goal of the model would not be to generate new melodies to fit a situation but rather to morph the timbre, harmonics and performance dynamics of known compositions to fit a particular game state (e.g. very dangerous ). This would require representing a music composition such that variations can be generated by adjusting parameters. One could then learn a mapping between those composition parameters and the changing state of the video game. Expression: Realistic Performance Timing and Dynamics Representative publications: D. Eck. Beat tracking using an autocorrelation phase matrix. In Proceedings of the 2007 International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1313 1316. IEEE Signal Processing Society, 2007.

Douglas Eck Research Statement 7 D. Eck. Finding long-timescale musical structure with an autocorrelation phase matrix. Music Perception, 24(2):167 176, 2006. D. Eck. Finding downbeats with a relaxation oscillator. Psychol. Research, 66(1):18 25, 2002. What is the difference between a musical score and a musical performance? With the growth of computergenerated music, this question is of practical as well as theoretical importance. Almost anyone who listens to a computer play a musical score notices how flat and wooden it sounds. The challenge here lies not in the ability for the computer to generate realistic musical instrument timbres. Rather it concerns how to generate realistic timing and dynamics. That is, much of what makes a performance a good one lies not in which notes are played, but rather when and how they are played 1. These aspects of performance have been researched by cognitive psychologists for nearly 20 years (see [19] for a slightly-dated but still relevant review). However it has proved difficult to quantify expressive performance such that machines can generate convincing performances. See [20] for a review of recent AI and machine learning attempts. Expressive music performance is a new area of research for me. However I have a long background in working with the related task of synchronizing in real time with music. In work that extended my dissertation I introduced a model that uses the dynamics of continually-spiking neurons [8] to find beats. The benefits of my approach are seen when large numbers of oscillating neurons are used to sample multiple levels of the metrical hierarchy. In this case, the fact that model neurons exhibit so-called relaxation oscillation becomes important. Relaxation oscillators are optimally-fast at synchronizing when coupled together in groups. This leads to rapid and stable phase-locking with a music-like input, even when dozens or hundreds of neurons are employed. For example, see Figure which displays an array of Fitzhugh-Nagumo neurons synchronizing with a metronome (middle) and a performance of the Beatles song Yesterday (right) played on a MIDI-enabled piano. More recently I developed an algorithm that extends autocorrelation-based meter detection [4] to preserve the necessary phase information for doing beat tracking [6, 7]. The core data structure of the algorithm is an phase-by-period matrix called an autocorrelation phase matrix (APM) Figure. The APM encapsulates important long-timescale dependencies in a performance. By limiting search to those lags suggested by the APM, it is possible to implement on-line beat tracking and improvisation. Finally, I have made advances in the related task of extracting note onsets from digital audio. Advances in this domain will make it possible to extract performance data directly from audio rather than from MIDI recordings which require a MIDI-enabled instrument. In work with student Alexandre Lacoste, I developed a note onset detection algorithm that used a supervised learning step to remove noise [14]. Two variants of the algorithm were 1st and 2nd out of was 1st out of nine entries in the Audio Onset Detection contest at MIREX 2005 [15]. 1 For those who are not convinced please compare two performances of a Chopin étude, one www.iro.umontreal. ca/ eckdoug/performance/deadpan.mp3 is played from the score by a computer (no performance dynamics or pedaling) while the other one www.iro.umontreal.ca/ eckdoug/performance/expressive.mp3 was played by a professional pianist. Note that the same score was used to generate both performances and that both were stored in MIDI format. The only differences are changes in timing, note velocities and the use of the pedal. For most listeners the differences between the two versions is vast.

Douglas Eck Research Statement 8 Figure 3: Array of 25 Fitzhugh-Nagumo spiking neurons (single waveform at left) synchronizing with a metronome (middle) and a performance of the Beatles Yesterday (right) played on a MIDI-enabled piano. Future Work In my current research I am focusing on two music performance issues. The first is the question of exactly how expressive timing and dynamics correlate with meter and grouping structure in music. Second is the question of how expressive timing and dynamics effects are achieved by performers? Currently I am in the process of collecting performance data from a large number of pianists recorded on my lab Boesendorfer concert grand piano. With student Stanislaus Lauly I am working on a number of algorithms for modeling expressive performance. To date one of the most promising is a sequential version of the Deep Belief Network (DBN) [11], an algorithm composed of multiple connected Restricted Boltzmann Machines (RBMs) trained using contrastive divergence. References [1] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157 166, 1994. [2] J. Bergstra, N. Casagrande, and D. Eck. Genre classification: Timbre- and rhythm-based multiresolution audio classification. URL http://www.music-ir.org/evaluation/ mirex-results. MIREX genre classification contest, 2005. [3] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl. Aggregate features and AdaBoost for music classification. Machine Learning, 65(2-3):473 484, 2006. [4] J. Brown. Determination of meter of musical scores by autocorrelation. Journal of the Acoustical Society of America, 94:953 957, 1993. [5] D. Cope. Computers and Musical Style. A-R Editions, Inc., Madison, Wisconsin, 1991. [6] D. Eck. Beat tracking using an autocorrelation phase matrix. In Proceedings of the 2007 International

Douglas Eck Research Statement 9 Figure 4: Autocorrelation phase matrix (APM) for a Cha-Cha-Cha song. The APM is useful for beat tracking and for finding deep structure (metrical hierarchy) in music audio. Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1313 1316. IEEE Signal Processing Society, 2007. [7] D. Eck. Finding long-timescale musical structure with an autocorrelation phase matrix. Music Perception, 24(2):167 176, 2006. [8] D. Eck. Finding downbeats with a relaxation oscillator. Psychol. Research, 66(1):18 25, 2002. [9] D. Eck and J. Schmidhuber. Finding temporal structure in music: Blues improvisation with LSTM recurrent networks. In H. Bourlard, editor, Neural Networks for Signal Processing XII, Proceedings of the 2002 IEEE Workshop, pages 747 756, New York, 2002. IEEE. [10] D. Eck, P. Lamere, T. Bertin-Mahieux, and S. Green. Automatic generation of social tags for music recommendation. In Neural Information Processing Systems Conference (NIPS) 20, 2007. [11] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 18(7):1527 1554, 2006. ISSN 0899-7667. doi: http://dx.doi.org/10.1162/neco.2006.18.7. 1527. [12] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735 1780, 1997. [13] H. Jaeger and D. Eck. Can t get you out of my head: A connectionist model of cyclic rehearsal. In Modeling Communications with Robots and Virtual Humans, LNCS. Springer-Verlag, 2007. To appear. [14] A. Lacoste and D. Eck. A supervised classification algorithm for note onset detection. EURASIP Journal on Applied Signal Processing, 2007(ID 43745):1 13, 2007.

Douglas Eck Research Statement 10 [15] A. Lacoste and D. Eck. Onset detection with artificial neural networks. URL http://www. music-ir.org/evaluation/mirex-results. MIREX note onset detection contest, 2005. [16] J. Paiement, D. Eck, and S. Bengio. A probabilistic model for chord progressions. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR 2005), pages 312 319, London: University of London, 2005. [17] J. Paiement, D. Eck, S. Bengio, and D. Barber. A graphical model for chord progressions embedded in a psychoacoustic space. In ICML 05: Proceedings of the 22nd international conference on Machine learning, pages 641 648, New York, NY, USA, 2005. ACM Press. [18] J. Paiement, Y. Grandvalet, S. Bengio, and D. Eck. A generative model for rhythms. NIPS 2007 Workshop on Music, Brain and Cognition, 2007. [19] C. Palmer. Music performance. Annual Review of Psychology, 48:115 138, 1997. URL http: //upload.mcgill.ca/spl/annrev97.pdf. [20] G. Widmer and W. Goebl. Computational models of expressive music performance: The state of the art. Journal of New Music Research, 33(3):203 216, 2004.