Sequence generation and classification with VAEs and RNNs

Size: px
Start display at page:

Download "Sequence generation and classification with VAEs and RNNs"

Transcription

1 Jay Hennig 1 * Akash Umakantha 1 * Ryan Williamson 1 * 1. Introduction Variational autoencoders (VAEs) (Kingma & Welling, 2013) are a popular approach for performing unsupervised learning that can also be used as generative models. VAEs model the data distribution as a nonlinear transformation of unobserved latent variables. Inference of the latent variables given the observed data is performed via a recognition (encoding) model, while the transformation from latent variables to data is modeled by a generating (decoding) model. The distribution of latent variables is encouraged during training to take a simple form, such as a normal distribution, making the data generating process straightforward. This process has shown to be successful at generating complicated forms of data such as digits and faces (Kingma & Welling, 2013). However, when the data is multi-modal, VAEs do not provide an explicit mechanism for specifying which mode our generated sample should come from. For example, we may wish to ensure we generate a 2 and not a 4. One possible approach is to model the latent distribution as a mixture distribution such as a mixture of Gaussians (Doersch, 2016). However, this approach is difficult to train and often not successful in practice (Makhzani et al., 2015). In this paper we introduce the Classifying VAE, in which we train a classifier to detect the mode of each data point simultaneously to training the recognition and generation models. By modeling the probability that a data point belongs to each mode as a Logistic Normal distribution, we can use the reparameterization trick of Stochastic Gradient Variational Bayes to sample from this distribution during training. In the case of generating sequence data, VAEs can be successfully combined with recurrent neural networks (RNNs). The combination of VAEs and RNNs form a class of models we refer to broadly as variational RNNs (vrnns) (Chung et al., 2015; Boulanger-Lewandowski * Equal contribution 1 Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Jay Hennig <jhennig@andrew.cmu.edu>, Akash Umakantha <aumakant@andrew.cmu.edu>, Ryan Williamson <rcw1@andrew.cmu.edu>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, JMLR: W&CP. Copyright 2017 by the author(s). et al., 2012; Fraccaro et al., 2016; Bayer & Osendorfer, 2014). When each sequence comes from a discrete class, we can similarly train a Classifying vrnn. We demonstrate that our models are useful for generating data from one of several discrete modes by training our models on polyphonic music data. In music, segments of music conform to a particular key, with different notes being appropriate in different keys. When generating music, therefore, it is important to be able to generate music from only one key at a time. Previous approaches often preprocess the training data to be in only one of two keys (Boulanger-Lewandowski et al., 2012), compared to up to 24 keys present in the original music. Even in this simplified setting, we show that models such as VAEs and vrnns are prone to generating samples that do not stay in key, resulting in dissonance and poor-sounding music. By contrast, we show that our Classifying VAE and Classifying vrnn can generate samples that are as in key as the original training data, even when the training data includes songs in 18 different keys. 2. Background & Related Work Modeling sequence data The standard application of RNNs to sequence data is to have them model the probabilistic distribution over sequences as follows (Bayer & Osendorfer, 2014): p(x 1:T ) = T p(x t x 1:t 1 ) t=1 The output of the RNN at each timestep t is then interpreted as the parameters of the term p(x t x 1:t 1 ). Boulanger-Lewandowski et al. (2012) combined an RNN with a restricted Boltzmann machine (RBM) in order to model correlations amongst the outputs at each timestep. Their paper established a baseline for music generation by reporting the performance of a variety of simple models such as GMMs, RBMs, HMMs, and n-grams. More recently, there has been a lot of interest in the variational Bayes approach for training deep generative models. The original paper by Kingma & Welling (2013),

2 though not related to modeling sequence data, introduced the Stochastic Gradient Variational Bayes (SGVB) estimator, and the variational autoencoder (VAE). The VAE model used a variational approximation to learn an encoder model, which mapped input data to a stochastic latent representation, and a decoder model, which mapped this latent representation back into the data space. Mathematically their model sought to maximize the following estimate of the variational lower bound: L(θ, φ; x) = D KL (q φ (z (i,l) x (i) ), p θ (z)) + E qφ (z x) log p θ (x (i) z (i,l) ) Their key insight was what they called the the reparametrization trick. The problem with classical backpropagation in a deep generative model framework is that it requires calculating gradients of a expectation with respect to a random variable z. In general, this is difficult. The reparametrization trick rewrites the random variable as a deterministic variable using some other random variable ɛ. The objective is then rewritten as the empirical lower bound: L(θ, φ; x) = D KL (q φ (z (i,l) x (i) ), p θ (z)) + 1 L L log p θ (x (i) z (i,l) ) l=1 where z = g φ (ɛ (i,l), x (i) ) and ɛ (i,l) p(ɛ). When z N (µ(x), σ 2 (x)), the KL term is reparametrized as follows: D KL (q φ (z (i,l) x (i) ), p θ (z)) = 1 J (1 + log σj 2 (x) (µ j (x)) 2 σj 2 (x)) 2 j=1 Recent papers have applied this variational perspective to modeling sequence data. Bayer & Osendorfer (2014) developed a model called STORN, which introduced stochastic latent variables to an RNN model, fit with SGVB. The sampled latent variables are assumed to be independent over time. Figure 1 shows a diagram of the STORN model. Next, Chung et al. (2015) developed what they called an Autoencoding RNN (VRNN), which extended STORN so that the latent variables depend on the sequence history and therefore evolve over time. Finally, work by Fraccaro et al. (2016) attempted to merge the RNN and state-space model perspective. Their approach, which they call a Stochastic RNN (SRNN), explicitly separates the stochastic and deterministic components of their model. Figure 1. The generative models for the VAE (Kingma & Welling, 2013) and vrnn (Bayer & Osendorfer, 2014; Chung et al., 2015) in black, with the additions of the Classifying VAE and Classifying vrnn shown in blue. z are continuous latent variables, while w specifies the class of X (in the Classifying VAE) or the class of a sequence X 1,..., X T (in the Classifying vrnn). Application to music In Western music, there are twelve possible pitch classes one can play (A Bb B C Db D Eb E Fb F Gb G Ab), and most songs are written using only a subset of these. The key of a song (or a part of a song) refers to the central tonic note (e.g., C, D, etc.), along with its mode (e.g., major or minor). For example, music written in the key of C major will tend to use only notes without sharps or flats (A B C D E F G), with C being the tonic note that most melodies end on. Music in the key of C minor, by contrast, will have the same tonic note (C) but a different overall set of notes (C D Eb F G Ab Bb). In this paper we consider 24 possible keys, referring to each of twelve tonic notes in either the major or minor mode. In this paper we model music data as a series of 88- dimensional binary vectors, X t {0, 1} 88, where the j th entry, X j t, represents whether or not the j th key on an 88- key piano was played at time t. We refer to X = {X t t = 1,..., T } as a music sequence with length T. If X is in the key of C major, from a probabilistic standpoint this means P (X j t ) 0 if the note referred to by X j t is not part of the key of C major. While some songs do occasionally change keys, we assume that the key is constant as long as T is not too large. 3. Methods Classifying Variational Autoencoder The joint distribution we consider is P (X, z, w) = P (x z, w)p (z w)p (w) where X is the observed data, and z and w are unobserved latent variables. We suppose that w is a vector of dimension d, specifying the probabilities that X comes from one of d distinct classes. We now follow the main idea of the variational autoen-

3 coder, which begins with constructing a variational lowerbound on the log of the marginal likelihood P (X) = P (X z, w)p (z w)p (w) w z. We can write the following equality: logp (X) D[Q(z, w X) P (z, w X)] = L(X) = E z,w Q [log P (X z, w)] D[Q(z, w X) P (z, w)] where D is the KL-divergence (which is always nonnegative), and L(X) is the evidence lower-bound, or ELBO. Applying the chain rule on the KL terms lets us rewrite the left-hand side as: log P (X) D[Q(z w, X) P (z w, X)] D[Q(w X) P (w X)] We can similarly rewrite the right-hand side: E z,w Q [log P (X z, w)] D[Q(z w, X) P (z w)] D[Q(w X) P (w)] Above, P (z w) and P (w) are priors, while Q is an arbitrary distribution. Critically, Q will be parameterized by a neural network, and we will optimize Q (with gradient descent) so as to maximize the ELBO, thereby maximizing the marginal likelihood. To keep our lower-bound on the marginal likelihood tight, we must ensure the KL divergences on the left-hand side are small. Typically, we don t have access to P (z X, w). However, we suppose that when training we do have access to the true w. So instead of minimizing D[Q(w X) P (w X)], we can minimize the cross-entropy between Q(w X) and the true w. In other words, minimizing this portion of the loss will encourage Q(w X) to classify w. By simultaneously minimizing cross-entropy loss while maximizing the the ELBO, we keep the lower bound on the data likelihood tight while also maximizing it. As with a standard VAE, we use the Stochastic Gradient Variational Bayes (SGVB) estimator of L(X), whereby we draw samples of w Q(w X) and z Q(z w, X) for each minibatch during training. One of the limitations of the VAE is that the latent variables z and w cannot be discrete, otherwise we could not perform backpropagation on the parameters of the neural network that specify Q. In this case, we wish for w to be the probabilities that X belongs to each of the d classes. We therefore parameterize Q(w X) as a Logistic Normal distribution. Samples w Q(w X) = LN (µ, Σ) have the property that 0 w i 1 for i = 1,..., d, and d i=1 w i = 1, so that w can be interpreted as our estimated probability that X belongs to each of the d classes. Critically, to use SGVB, we must be able to apply the reparameterization trick and generate samples of z and w as differentiable, deterministic functions of X and some auxiliary variables ɛ with independent marginals. (It is important to note that samples from the Dirichlet, another option for Q(w X), cannot be trivially written in this form.) A common choice for the distribution of z is a Normal distribuion, to which the reparameterization trick applies. For w, samples from the Logistic Normal distribution can be generated deterministically from a sample from a Normal distribution with the same parameters. Specifically, if y N (µ, Σ) is a sample from a Normal distribution with y R d 1, then w LN (µ, Σ) is a sample from a Logistic Normal if we set w j = w d = 1 1+ d 1 j=1 yj. 4. Experiments e y j 1+ d 1 j=1 yj for j = 1,..., d 1 and Music generation. We apply our method to generate sequences of polyphonic music in the form of piano-roll data. For training data we use the entire corpus of 382 fourpart harmonized chorales by J.S. Bach ( JSB Chorales ), obtained from the Python package music21 (Cuthbert & Ariza, 2010). This corpus was converted to piano-roll notation, a binary matrix denoting when each of 88 notes (from A0 to C8) was played over time (with time divided into quarter or eighth notes, depending on the data set). One of the most popular algorithms for detecting the key of a musical sequence is the Krumhansl-Schmuckler algorithm (Krumhansl, 2001). This algorithm finds the proportion of each pitch class present in a sequence, and compares this with the proportions expected in each key. We used the implementation of this algorithm in music21 to establish the ground-truth key of each musical sequence in our training data. The JSB Chorales corpus includes songs in 18 different keys. When Classifying the key, we treat major and minor scales that have the same notes (e.g., C major and A minor) as the same key, resulting in 10 distinct key classes. Additionally, for comparison with previous work on this data set, we also present results for models trained on songs transposed so that each song is in either C major or C minor. Finally, we perform further experiments on the Pianomidi.de data set, with songs transposed to either C major or C minor, as seen in Boulanger-Lewandowski et al. (2012). Implementation of the VAE and Classifying VAE. To model temporal data with a VAE and Classifying VAE, we considered two options. First, we may consider each timestep in isolation, autoencoding X t to X t via z t and w t. Note that in all models, w t is constant for all t referring to the same song. Our second option is to autoencode X t to X t+1. We found that the former approach led to sam-

4 ples with better musical qualities and lower cross-validated reconstruction error. For the Classifying VAE, we assume the following priors on z and w: For Q, we use: P (z w) = P (z) = N (0, I) P (w) = LN (0, I) Q(w X) = LN (µ w (X), diag[σ 2 w(x)]) Q(z X, w) = N (µ z (X, w), diag[σ 2 z(x, w)]) where µ w and σ 2 w are implemented as the outputs of a neural network (the classifier ), and µ z and σ 2 z implemented as the outputs of a different neural network (the encoder ). Note that the outputs of the encoder are a function of X as well as w, so when training we must draw a sample w Q(w X) as described above before computing µ z (X, w) and σ 2 z(x, w). Note that the true w is used during training only in the loss function. Finally, we assume P (X w, z) is an 88D Bernoulli distribution with probabilities p(w, z) as the output of a neural network (the decoder ). For the classifier, encoder, and decoder network, we use multilayer perceptrons (one for each network) with one hidden layer and ReLu activation. In practice, we found that Classifying the key of a musical sequence is not difficult, even when the training data includes songs in 18 different keys. To implement the standard VAE, we ignore w in all equations above, and use only an encoder and decoder network as in Kingma & Welling (2013). Implementation of the vrnn and Classifying vrnn. For our standard vrnn, we implement a network similar to STORN (Bayer & Osendorfer, 2014). In this case, the encoder and decoder networks are each replaced with an LSTM with ReLu activation, followed by a single dense layer. The Classifying vrnn is similar except for the addition of a classifier, implemented as in the Classifying VAE. Critically, we classify the key using an entire sequence of X. We leave the sequence length as a hyperparameter to be chosen. A sample w Q(w X) is then fed into the encoder and decoder networks at each timestep of the sequence. Generating music samples. When generating samples with a VAE, typically one would draw z t N(0, I), pass this value through the decoder to get p(z t ), and then sample X t from those probabilities. However, we found that the generated samples had better musical qualities if we drew z t Q(z X t 1 ). In other words, we sample z t from the encoding of X t 1, which is the previously generated sample. We used this approach when generating samples for all models. Training details. We built all models using Keras with a Tensorflow backend. All weights were initialized to a random sample N (0, 0.01), and trained with stochastic gradient descent using Adam ( Kingma & Ba (2014)). To prevent overfitting, we stopped training when the loss on our validation data did not increase for two consecutive epochs. We found that using weight normalization (Salmimans & Kingma, 2016) resulted in faster convergence. We performed a hyperparameter search using grid search, choosing the hyperparameters that resulted in the lowest loss on validation data (see Tables 1, 2, and 3). For the VAE and Classifying VAE, batch sizes in the range of were preferred, along with a latent dimensionality around 4-8. For the vrnn and Classifying vrnn, a batch size of was sufficient, along with a latent dimensionality of around 8-16, and a sequence length of around For all hidden perceptron layers (i.e., for the encoders, decoders, and classifiers) we found a dimensionality equal to 88, the input dimensionality, worked best. In both the VAE and vrnn models, hyperparameter search found that the Classifying models generally needed fewer latent dimensions than the unsupervised models. We attribute this to the fact that the extra Classifying network in the Classifying models adds predictive power and model complexity. Experiment 1. VAE Classifying VAE Figure 2. Comparison of the inferred values of z for the VAE and Classifying VAE from held-out test data with songs in the key of either C major (red points) or C minor (blue points). The Classifying VAE encodes z after first inferring the probability that X came from a song in one of the two keys. The Classifying VAE s encoded values of z appear to more closely match the imposed prior z N(0, I) than do those of the VAE. The first question we ask is: how does the latent representation differ (or not differ) between the VAE and Classifying VAE? We first explore the differences between the two models when the dimensionality of the continuous latent variables z is 2, and the training data contained songs in

5 JSB, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE n/a Classifying VAE n/a vrnn Classifying vrnn Table 1. Results of hyperparameter search for models trained on the 2 key version of JSB Chorales. JSB, 10 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE n/a Classifying VAE n/a vrnn Classifying vrnn Table 2. Results of hyperparameter search for models trained on the 10 key version of JSB Chorales. either C major or C minor. In this case, we can easily visualize the representation of the latent space learned by each network when encoding music in one of two keys. Additionally, because the VAE and Classifying VAE are not temporal models, the encoding and decoding of z t depends only on the notes played at the current timestep, X t, rather than on the sequence history. Figure 3. Heatmap of decoded probabilities of X given different values of z and w, for a Classifying VAE trained on music in C major or C minor. For the same value of z, the value of w determines whether the decoded notes are appropriate for either C major (when w = 0) or C minor (when w = 1). There are several differences in the latent spaces of the VAE and Classifying VAE (Fig 2). In the VAE, location in the latent space is important in determining what key the note/song is in. Notes in C minor (blue) have lower variance than notes in C major. Thus, a latent value near the center (e.g. [0,0]) will be more likely to be a note from C minor, and values near the edge (e.g. [0,2]) will be more likely to be a note from C major. Furthermore, samples/songs that have latents in the center will likely change keys over time. In the Classifying VAE, the additional parameter w encodes the key of the song. This allows the representations of of C major and C minor to be highly overlapping in the latent space. Furthermore, both latent distributions more closely resemble the prior distribution N(0, I). Indeed, since the Classifying VAE is explicitly modeling the key of the class, the burden of clustering the keys (as in the VAE) is removed from the latent space. Fig 3 shows that in the Classifying VAE, the same latent variable values (z) can give rise to very different notes and chords in the input space depending on the value of w. Experiment 2. Having a better intuition of the latent space, we now move to evaluating and comparing the performance of the various models relative to each other. We trained the VAE, Classifying VAE, vrnn, and Classifying vrnn on songs in 3 different datasets: JSB Chorales with 2 keys, JSB Chorales with 18 keys (10 distinct keys), and Piano-midi.de with 2 keys. We no longer constrained the latent dimensionality to be 2, but instead use the results of our hyperparameter search (Tables 1, 2, 3). First, we compared the average negative log likelihood (i.e. reconstruction error) of each model on held-out validation data. Note that since the VAEs are auto-encoding and the vrnns are predicting sequences, it is not necessarily fair to compare the two. The real comparison is between Classifying vs. non-classifying models. From the results in Table 4, it was unclear based on this metric alone whether classification models improve performance: in come cases it was better (JSB 2 keys, Classifying VAE vs VAE), sometimes worse, and in other cases comparable to the unsupervised model. We recognize that log likelihood is not

6 PianoMidi, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE n/a Classifying VAE n/a vrnn Classifying vrnn Table 3. Results of hyperparameter search for models trained on the 2 key version of PianoMidi.de. Data set VAE Classifying VAE vrnn Classifying vrnn JSB Chorales (2 keys) JSB Chorales (10 keys) Piano-midi.de (2 keys) Table 4. Average negative log-likelihoods on test data for models fit to data transposed to C major or C minor in the JSB and PianoMidi datasets. Hyperparameters used are those in Table 1 and Table 3 a perfect metric for example, generative adversarial networks (GANs) have been gaining in popularity lately due to their ability to generate very realistic samples, yet their log likelihoods are poor relative to other models like VAEs. Thus, we turn to other performance metrics to compare these models. One very important metric is how well a song stays in key. Indeed, the inspiration for this model was to generate sequences that do so better than the standard VAE and vrnn. We generated samples from each model after seeding the networks with musical sequences from held-out test data. Each model generated music for the next T = 32 timesteps, for N = 7000 different seed sequences in the test data. To measure how well these generated samples preserve the key of the seed sequence, we calculated the proportion of notes in the generated sequences that were in the key of the seed sequence. We observed that the seed sequences themselves ( Data in Fig. 4) were not perfectly in key; this is presumably because the ground-truth key of the test data was measured per song and not per sequence. In any case, we found that the Classifying models all generated samples that were as in key as the test data itself. This is in comparison to the VAE and vrnn models, which generated samples whose notes were not consistently in key. Note that chance performance in this metric is 67% because keys are composed of eight of the twelve pitch classes, and so randomly selecting notes from all twelve pitches classes would result in 8 of 12 notes being in key. Furthermore, in experiments (not shown) where we flip the value of w such that it now represents the opposite key, we observed that the percentage of notes in key drastically decreased. This suggests that the value of w is in fact controlling how well generated notes stay in key. Figure 4. Comparison of the percentage of notes from generated samples that are in key, after training on data in 18 different keys. Gray dotted line shows chance performance of 67%. Boxes depict the 25th, 50th, and 75th percentiles across N = 7000 samples. Each model generated a sample for T = 32 timesteps after being seeded with a musical sequence from held-out test data in one of eighteen different keys. The Classifying vrnn showed similar performance when samples were generated by inferring the key of the seed (w) using its classifier network, or when being given the true key. These percentages were similar to the percentage of notes in key in the test data ( Data ). Ideally the benefit of consistent, selectable key would come without any penalty in music quality. We therefore next compared the samples from each model on other musicality metrics to assess how the quality of the music compared to that of the original data set. In doing these comparisons, we generated samples from a set of 3450 seed sequences, all of which had been transposed to the key of C major. We assessed three musicality metrics in addition to the key consistency described above. First, we assessed the number of tones played on average at each time point (indicated as SimulTones in Fig. 5). We found that the Classifying vrnn had a slightly higher number of tones per time

7 Sequence generation and classification with VAEs and RNNs step than the other models and the original data. We next measured the tone span, which is the average difference between the highest and lowest tones played across all samples. We found that the Classifying vrnn was much more similar to the original data than the VAE or the vrnn. Finally, we measured polyphony, which is the percentage of time points per sample that more than one note is played. Here we set a maximum number of notes per time step at 5, since more than 5 notes per time step would be a significant departure from the training music (which had 4 tones at nearly all time steps). All models show less polyphony that the original music, with the Classifying vrnn having the lowest polyphony. In summary, the Classifying vrnn does better than the VAE and the vrnn at maintaining a key and in mimicking the tone span of the original music. However, the Classifying vrnn had a slightly higher tone span than the original and a lower polyphony. Figure 6. Using a Classifying VAE to separate style and content for MNIST. Top: Decoded digits when varying the continuous latents z while keeping the class labels w fixed, for two examples. Bottom: The left column shows an example digit from the heldout test data, and the numbers to the right show the model s generated analogies to this digit, found by using the same values of z but varying the class label w. 5. Conclusion Figure 5. Comparison of VAE, vrnn, and Classifying vrnn (cvrnn) on three musicality metrics. SimulTone indicates the number of tones played on each time step. ToneSpan indicates the average distance across samples between the highest tone and the lowest tone. Polyphony indicates the percentage of samples with than 2 tones. Note that each bar group is normalized by the maximum value in each bar group to allow for comparison on the same axes. Experiment 3. As another validation of our proposed model, we trained a Classifying VAE on MNIST data and show that it can perform style and content separation (similar to Fig. 4) by using continuous latent variables z along with class labels w, where each class is a different digit. The implementation details are the same as before, except for our classifier network we use a multilayer perceptron with two hidden layers containing 512 units each, with 20% dropout. Figure 6 shows examples of digits generated by the model both when holding w fixed while varying z (top), and when holding z fixed while varying w (bottom). Kingma et al. (2014) proposed a generative semisupervised model that is related to our work. In their generative process, the continuous latents z have a mulivariate normal prior, class variables y are generated from a multinomaial (and treated as latents if data is unlabelled) and the data generative process is a function of y, z, and neural network parameters. Our proposed model differs in a few ways. First, their model assumes marginal independence between y and z given data x (i.e. q(y, z x) = q(y x)q(z x)). Our factorization differs in that we do not assume independence: q(y, z x) = q(z y, x)q(y x). Even without this assumption, our model is able to recover a representation that disentangles style and content. Secondly, in our inference network for q(y x), we parameterize q(y x) as a log normal distribution. Since the log normal distribution is continuous, we can implement stochastic backpropagation (SGVB) through our classifier network in addition to our network for q(z x, y). This enables training both the classifier and autoencoder simultaneously, in contrast to the model in Kingma et al. (2014) which must train these layers separately. Additionally, we not consider training in the semi-supervised case of partially-labeled data. Another architecture applied to similar problems as VAEs is the generative adversarial network (GAN) (Goodfellow et al., 2014). Two GAN structures have been developed that

8 are particularly relevant to the model proposed in this paper. First is the Adversarial Autoencoder (Makhzani et al., 2015). This model introduces an adversarial network in the output of the latent variable layer of an autoencoder. The role of the adversarial layer is to force the encoder to map input data to latents such that the distribution of the mapped data matches the prior. This model has two advantages over the VAE. First it does not require a reparametrizeable prior, and can therefore employ any arbitrary prior as long as it can be sampled from. Second, the model generally produces representations in the latent space that are closer to the imposed prior. This is especially true in multi-modal distributions (e.g., mixture of Gaussians) which could plausibly be used to generate data from specific classes. In addition, adversarial GANs can incorporate labels in the hidden layer (similar to the w latent variable used here), allowing for generation of data with specific class labels. This architecture has only been constructed for static data and a sequential model has not yet been developed. Critically, the training of GANs is non-trivial and therefore less straightforward than for the VAE. A second relevant GAN architecture is the continuous Recurrent Neural Network Generative Adversarial Network (c-rnn-gan) (Mogren, 2016). This model is essentially the adversarial equivalent to the vrnn model discussed earlier. This model uses a uni-directional recurrent generative model and a bi-directional discriminator model. This model has been applied successfully to music generation, although we leave comparisons between this model and the one proposed here for future work. We note that the c-rnn-gan does not incorporate key labels and would therefore likely suffer from the same limitations on key stability as the vrnn when trained on samples that have not been transposed to a single key. In summary we developed an extension of the variational autoencoder that allows for specification of class labels in terms of categorical latent variables. We demonstrate that this method is effective in generating data instances from selected classes using two applications: one where class label determines the key of the generated music and a second where the class label determines the value of the MNIST digit. In the music generation task our method demonstrated superior stability of key over the sample sequence with comparable performance on other non-key related metrics of musicality. We showed that the Classifying VAE is able to independently encode style and content in both the music generation and MNIST data sets. An interesting future direction is to see whether the Classifying vrnn is similarly able to separate style and content in its temporal sequences. For example, given a song in a particular key, can we somehow vary the class label and transpose the song to a different key? Another potential benefit of this model would be to generate songs following a particular chord progression. This would involve changing the interpretation of the class label w to be the current chord being played (instead of the key of the entire song), which may lead to an effective method for generating melodies to play on top of a given chord progression. References Bayer, Justin and Osendorfer, Christian. Learning stochastic recurrent networks. arxiv preprint arxiv: , Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML, Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C., and Bengio, Yoshua. A recurrent latent variable model for sequential data. arxiv, , Cuthbert, Michael Scott and Ariza, Christopher. Music21, a toolkit for computer-aided musicology and symbolic music data. In International Society for Music Information Retrieval Confrence (ISMIR), Doersch, Carl. Tutorial on variational autoencoders. arxiv preprint arxiv: , Fraccaro, Marco, Sønderby, Søren Kaae, Paquet, Ulrich, and Winther, Ole. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pp , Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. In NIPS, Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arxiv preprint arxiv: , Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. arxiv, , Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp , Krumhansl, Carol L. Cognitive foundations of musical pitch. Oxford University Press, 2001.

9 Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and Goodfellow, Ian J. Adversarial autoencoders. arxiv, , Mogren, Olof. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arxiv, , Salmimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arxiv preprint arxiv: , Sequence generation and classification with VAEs and RNNs

Music Composition with RNN

Music Composition with RNN Music Composition with RNN Jason Wang Department of Statistics Stanford University zwang01@stanford.edu Abstract Music composition is an interesting problem that tests the creativity capacities of artificial

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 Deep Learning for Music arxiv:1606.04930v1 [cs.lg] 15 Jun 2016 Allen Huang Department of Management Science and Engineering Stanford University allenh@cs.stanford.edu Abstract Raymond Wu Department of

More information

Using Variational Autoencoders to Learn Variations in Data

Using Variational Autoencoders to Learn Variations in Data Using Variational Autoencoders to Learn Variations in Data By Dr. Ethan M. Rudd and Cody Wild Often, we would like to be able to model probability distributions of high-dimensional data points that represent

More information

LSTM Neural Style Transfer in Music Using Computational Musicology

LSTM Neural Style Transfer in Music Using Computational Musicology LSTM Neural Style Transfer in Music Using Computational Musicology Jett Oristaglio Dartmouth College, June 4 2017 1. Introduction In the 2016 paper A Neural Algorithm of Artistic Style, Gatys et al. discovered

More information

Detecting Musical Key with Supervised Learning

Detecting Musical Key with Supervised Learning Detecting Musical Key with Supervised Learning Robert Mahieu Department of Electrical Engineering Stanford University rmahieu@stanford.edu Abstract This paper proposes and tests performance of two different

More information

arxiv: v1 [cs.sd] 8 Jun 2016

arxiv: v1 [cs.sd] 8 Jun 2016 Symbolic Music Data Version 1. arxiv:1.5v1 [cs.sd] 8 Jun 1 Christian Walder CSIRO Data1 7 London Circuit, Canberra,, Australia. christian.walder@data1.csiro.au June 9, 1 Abstract In this document, we introduce

More information

CS229 Project Report Polyphonic Piano Transcription

CS229 Project Report Polyphonic Piano Transcription CS229 Project Report Polyphonic Piano Transcription Mohammad Sadegh Ebrahimi Stanford University Jean-Baptiste Boin Stanford University sadegh@stanford.edu jbboin@stanford.edu 1. Introduction In this project

More information

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING

A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING A STUDY ON LSTM NETWORKS FOR POLYPHONIC MUSIC SEQUENCE MODELLING Adrien Ycart and Emmanouil Benetos Centre for Digital Music, Queen Mary University of London, UK {a.ycart, emmanouil.benetos}@qmul.ac.uk

More information

Deep Jammer: A Music Generation Model

Deep Jammer: A Music Generation Model Deep Jammer: A Music Generation Model Justin Svegliato and Sam Witty College of Information and Computer Sciences University of Massachusetts Amherst, MA 01003, USA {jsvegliato,switty}@cs.umass.edu Abstract

More information

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng Introduction In this project we were interested in extracting the melody from generic audio files. Due to the

More information

A Discriminative Approach to Topic-based Citation Recommendation

A Discriminative Approach to Topic-based Citation Recommendation A Discriminative Approach to Topic-based Citation Recommendation Jie Tang and Jing Zhang Department of Computer Science and Technology, Tsinghua University, Beijing, 100084. China jietang@tsinghua.edu.cn,zhangjing@keg.cs.tsinghua.edu.cn

More information

arxiv: v3 [cs.sd] 14 Jul 2017

arxiv: v3 [cs.sd] 14 Jul 2017 Music Generation with Variational Recurrent Autoencoder Supported by History Alexey Tikhonov 1 and Ivan P. Yamshchikov 2 1 Yandex, Berlin altsoph@gmail.com 2 Max Planck Institute for Mathematics in the

More information

A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES

A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES A PROBABILISTIC TOPIC MODEL FOR UNSUPERVISED LEARNING OF MUSICAL KEY-PROFILES Diane J. Hu and Lawrence K. Saul Department of Computer Science and Engineering University of California, San Diego {dhu,saul}@cs.ucsd.edu

More information

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017 Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017 Background Abstract I attempted a solution at using machine learning to compose music given a large corpus

More information

Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure

Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure Deep Recurrent Music Writer: Memory-enhanced Variational Autoencoder-based Musical Score Composition and an Objective Measure Romain Sabathé, Eduardo Coutinho, and Björn Schuller Department of Computing,

More information

Automatic Rhythmic Notation from Single Voice Audio Sources

Automatic Rhythmic Notation from Single Voice Audio Sources Automatic Rhythmic Notation from Single Voice Audio Sources Jack O Reilly, Shashwat Udit Introduction In this project we used machine learning technique to make estimations of rhythmic notation of a sung

More information

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello Structured training for large-vocabulary chord recognition Brian McFee* & Juan Pablo Bello Small chord vocabularies Typically a supervised learning problem N C:maj C:min C#:maj C#:min D:maj D:min......

More information

Neural Network for Music Instrument Identi cation

Neural Network for Music Instrument Identi cation Neural Network for Music Instrument Identi cation Zhiwen Zhang(MSE), Hanze Tu(CCRMA), Yuan Li(CCRMA) SUN ID: zhiwen, hanze, yuanli92 Abstract - In the context of music, instrument identi cation would contribute

More information

Generating Music with Recurrent Neural Networks

Generating Music with Recurrent Neural Networks Generating Music with Recurrent Neural Networks 27 October 2017 Ushini Attanayake Supervised by Christian Walder Co-supervised by Henry Gardner COMP3740 Project Work in Computing The Australian National

More information

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception LEARNING AUDIO SHEET MUSIC CORRESPONDENCES Matthias Dorfer Department of Computational Perception Short Introduction... I am a PhD Candidate in the Department of Computational Perception at Johannes Kepler

More information

SentiMozart: Music Generation based on Emotions

SentiMozart: Music Generation based on Emotions SentiMozart: Music Generation based on Emotions Rishi Madhok 1,, Shivali Goel 2, and Shweta Garg 1, 1 Department of Computer Science and Engineering, Delhi Technological University, New Delhi, India 2

More information

Hidden Markov Model based dance recognition

Hidden Markov Model based dance recognition Hidden Markov Model based dance recognition Dragutin Hrenek, Nenad Mikša, Robert Perica, Pavle Prentašić and Boris Trubić University of Zagreb, Faculty of Electrical Engineering and Computing Unska 3,

More information

An AI Approach to Automatic Natural Music Transcription

An AI Approach to Automatic Natural Music Transcription An AI Approach to Automatic Natural Music Transcription Michael Bereket Stanford University Stanford, CA mbereket@stanford.edu Karey Shi Stanford Univeristy Stanford, CA kareyshi@stanford.edu Abstract

More information

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input.

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input. RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input. Joseph Weel 10321624 Bachelor thesis Credits: 18 EC Bachelor Opleiding Kunstmatige

More information

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 Story so far MLPs are universal function approximators Boolean functions, classifiers, and regressions MLPs can be

More information

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You Chris Lewis Stanford University cmslewis@stanford.edu Abstract In this project, I explore the effectiveness of the Naive Bayes Classifier

More information

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications Introduction Brandon Richardson December 16, 2011 Research preformed from the last 5 years has shown that the

More information

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering, DeepID: Deep Learning for Face Recognition Xiaogang Wang Department of Electronic Engineering, The Chinese University i of Hong Kong Machine Learning with Big Data Machine learning with small data: overfitting,

More information

JazzGAN: Improvising with Generative Adversarial Networks

JazzGAN: Improvising with Generative Adversarial Networks JazzGAN: Improvising with Generative Adversarial Networks Nicholas Trieu and Robert M. Keller Harvey Mudd College Claremont, California, USA ntrieu@hmc.edu, keller@cs.hmc.edu Abstract For the purpose of

More information

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Research Article. ISSN (Print) *Corresponding author Shireen Fathima Scholars Journal of Engineering and Technology (SJET) Sch. J. Eng. Tech., 2014; 2(4C):613-620 Scholars Academic and Scientific Publisher (An International Publisher for Academic and Scientific Resources)

More information

A probabilistic framework for audio-based tonal key and chord recognition

A probabilistic framework for audio-based tonal key and chord recognition A probabilistic framework for audio-based tonal key and chord recognition Benoit Catteau 1, Jean-Pierre Martens 1, and Marc Leman 2 1 ELIS - Electronics & Information Systems, Ghent University, Gent (Belgium)

More information

Composer Style Attribution

Composer Style Attribution Composer Style Attribution Jacqueline Speiser, Vishesh Gupta Introduction Josquin des Prez (1450 1521) is one of the most famous composers of the Renaissance. Despite his fame, there exists a significant

More information

Feature-Based Analysis of Haydn String Quartets

Feature-Based Analysis of Haydn String Quartets Feature-Based Analysis of Haydn String Quartets Lawson Wong 5/5/2 Introduction When listening to multi-movement works, amateur listeners have almost certainly asked the following situation : Am I still

More information

Music Genre Classification and Variance Comparison on Number of Genres

Music Genre Classification and Variance Comparison on Number of Genres Music Genre Classification and Variance Comparison on Number of Genres Miguel Francisco, miguelf@stanford.edu Dong Myung Kim, dmk8265@stanford.edu 1 Abstract In this project we apply machine learning techniques

More information

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity

Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Multiple instrument tracking based on reconstruction error, pitch continuity and instrument activity Holger Kirchhoff 1, Simon Dixon 1, and Anssi Klapuri 2 1 Centre for Digital Music, Queen Mary University

More information

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS

CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS CHORD GENERATION FROM SYMBOLIC MELODY USING BLSTM NETWORKS Hyungui Lim 1,2, Seungyeon Rhyu 1 and Kyogu Lee 1,2 3 Music and Audio Research Group, Graduate School of Convergence Science and Technology 4

More information

Chord Classification of an Audio Signal using Artificial Neural Network

Chord Classification of an Audio Signal using Artificial Neural Network Chord Classification of an Audio Signal using Artificial Neural Network Ronesh Shrestha Student, Department of Electrical and Electronic Engineering, Kathmandu University, Dhulikhel, Nepal ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University Abstract A model of music needs to have the ability to recall past details and have a clear,

More information

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS First Author Affiliation1 author1@ismir.edu Second Author Retain these fake authors in submission to preserve the formatting Third

More information

Rewind: A Music Transcription Method

Rewind: A Music Transcription Method University of Nevada, Reno Rewind: A Music Transcription Method A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science and Engineering by

More information

CPU Bach: An Automatic Chorale Harmonization System

CPU Bach: An Automatic Chorale Harmonization System CPU Bach: An Automatic Chorale Harmonization System Matt Hanlon mhanlon@fas Tim Ledlie ledlie@fas January 15, 2002 Abstract We present an automated system for the harmonization of fourpart chorales in

More information

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS 12th International Society for Music Information Retrieval Conference (ISMIR 2011) A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS Juhan Nam Stanford

More information

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes hello Jay Biernat Third author University of Rochester University of Rochester Affiliation3 words jbiernat@ur.rochester.edu author3@ismir.edu

More information

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS Mutian Fu 1 Guangyu Xia 2 Roger Dannenberg 2 Larry Wasserman 2 1 School of Music, Carnegie Mellon University, USA 2 School of Computer

More information

Topic 10. Multi-pitch Analysis

Topic 10. Multi-pitch Analysis Topic 10 Multi-pitch Analysis What is pitch? Common elements of music are pitch, rhythm, dynamics, and the sonic qualities of timbre and texture. An auditory perceptual attribute in terms of which sounds

More information

Automatic Piano Music Transcription

Automatic Piano Music Transcription Automatic Piano Music Transcription Jianyu Fan Qiuhan Wang Xin Li Jianyu.Fan.Gr@dartmouth.edu Qiuhan.Wang.Gr@dartmouth.edu Xi.Li.Gr@dartmouth.edu 1. Introduction Writing down the score while listening

More information

Modeling Musical Context Using Word2vec

Modeling Musical Context Using Word2vec Modeling Musical Context Using Word2vec D. Herremans 1 and C.-H. Chuan 2 1 Queen Mary University of London, London, UK 2 University of North Florida, Jacksonville, USA We present a semantic vector space

More information

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification INTERSPEECH 17 August, 17, Stockholm, Sweden A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification Yun Wang and Florian Metze Language

More information

Lecture 9 Source Separation

Lecture 9 Source Separation 10420CS 573100 音樂資訊檢索 Music Information Retrieval Lecture 9 Source Separation Yi-Hsuan Yang Ph.D. http://www.citi.sinica.edu.tw/pages/yang/ yang@citi.sinica.edu.tw Music & Audio Computing Lab, Research

More information

BachBot: Automatic composition in the style of Bach chorales

BachBot: Automatic composition in the style of Bach chorales BachBot: Automatic composition in the style of Bach chorales Developing, analyzing, and evaluating a deep LSTM model for musical style Feynman Liang Department of Engineering University of Cambridge M.Phil

More information

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

Finding Sarcasm in Reddit Postings: A Deep Learning Approach Finding Sarcasm in Reddit Postings: A Deep Learning Approach Nick Guo, Ruchir Shah {nickguo, ruchirfs}@stanford.edu Abstract We use the recently published Self-Annotated Reddit Corpus (SARC) with a recurrent

More information

Chord Representations for Probabilistic Models

Chord Representations for Probabilistic Models R E S E A R C H R E P O R T I D I A P Chord Representations for Probabilistic Models Jean-François Paiement a Douglas Eck b Samy Bengio a IDIAP RR 05-58 September 2005 soumis à publication a b IDIAP Research

More information

Algorithmic Music Composition using Recurrent Neural Networking

Algorithmic Music Composition using Recurrent Neural Networking Algorithmic Music Composition using Recurrent Neural Networking Kai-Chieh Huang kaichieh@stanford.edu Dept. of Electrical Engineering Quinlan Jung quinlanj@stanford.edu Dept. of Computer Science Jennifer

More information

Music Information Retrieval with Temporal Features and Timbre

Music Information Retrieval with Temporal Features and Timbre Music Information Retrieval with Temporal Features and Timbre Angelina A. Tzacheva and Keith J. Bell University of South Carolina Upstate, Department of Informatics 800 University Way, Spartanburg, SC

More information

Singer Traits Identification using Deep Neural Network

Singer Traits Identification using Deep Neural Network Singer Traits Identification using Deep Neural Network Zhengshan Shi Center for Computer Research in Music and Acoustics Stanford University kittyshi@stanford.edu Abstract The author investigates automatic

More information

Music Genre Classification

Music Genre Classification Music Genre Classification chunya25 Fall 2017 1 Introduction A genre is defined as a category of artistic composition, characterized by similarities in form, style, or subject matter. [1] Some researchers

More information

AUTOMATIC STYLISTIC COMPOSITION OF BACH CHORALES WITH DEEP LSTM

AUTOMATIC STYLISTIC COMPOSITION OF BACH CHORALES WITH DEEP LSTM AUTOMATIC STYLISTIC COMPOSITION OF BACH CHORALES WITH DEEP LSTM Feynman Liang Department of Engineering University of Cambridge fl350@cam.ac.uk Mark Gotham Faculty of Music University of Cambridge mrhg2@cam.ac.uk

More information

Supervised Learning in Genre Classification

Supervised Learning in Genre Classification Supervised Learning in Genre Classification Introduction & Motivation Mohit Rajani and Luke Ekkizogloy {i.mohit,luke.ekkizogloy}@gmail.com Stanford University, CS229: Machine Learning, 2009 Now that music

More information

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller)

Topic 11. Score-Informed Source Separation. (chroma slides adapted from Meinard Mueller) Topic 11 Score-Informed Source Separation (chroma slides adapted from Meinard Mueller) Why Score-informed Source Separation? Audio source separation is useful Music transcription, remixing, search Non-satisfying

More information

Joint Image and Text Representation for Aesthetics Analysis

Joint Image and Text Representation for Aesthetics Analysis Joint Image and Text Representation for Aesthetics Analysis Ye Zhou 1, Xin Lu 2, Junping Zhang 1, James Z. Wang 3 1 Fudan University, China 2 Adobe Systems Inc., USA 3 The Pennsylvania State University,

More information

Image-to-Markup Generation with Coarse-to-Fine Attention

Image-to-Markup Generation with Coarse-to-Fine Attention Image-to-Markup Generation with Coarse-to-Fine Attention Presenter: Ceyer Wakilpoor Yuntian Deng 1 Anssi Kanervisto 2 Alexander M. Rush 1 Harvard University 3 University of Eastern Finland ICML, 2017 Yuntian

More information

Jazz Melody Generation and Recognition

Jazz Melody Generation and Recognition Jazz Melody Generation and Recognition Joseph Victor December 14, 2012 Introduction In this project, we attempt to use machine learning methods to study jazz solos. The reason we study jazz in particular

More information

arxiv: v1 [cs.sd] 9 Dec 2017

arxiv: v1 [cs.sd] 9 Dec 2017 Music Generation by Deep Learning Challenges and Directions Jean-Pierre Briot François Pachet Sorbonne Universités, UPMC Univ Paris 06, CNRS, LIP6, Paris, France Jean-Pierre.Briot@lip6.fr Spotify Creator

More information

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park katepark@stanford.edu Annie Hu anniehu@stanford.edu Natalie Muenster ncm000@stanford.edu Abstract We propose detecting

More information

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video Mohamed Hassan, Taha Landolsi, Husameldin Mukhtar, and Tamer Shanableh College of Engineering American

More information

Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation

Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation INTRODUCTION Modeling Temporal Tonal Relations in Polyphonic Music Through Deep Networks with a Novel Image-Based Representation Ching-Hua Chuan 1, 2 1 University of North Florida 2 University of Miami

More information

Robert Alexandru Dobre, Cristian Negrescu

Robert Alexandru Dobre, Cristian Negrescu ECAI 2016 - International Conference 8th Edition Electronics, Computers and Artificial Intelligence 30 June -02 July, 2016, Ploiesti, ROMÂNIA Automatic Music Transcription Software Based on Constant Q

More information

Melody classification using patterns

Melody classification using patterns Melody classification using patterns Darrell Conklin Department of Computing City University London United Kingdom conklin@city.ac.uk Abstract. A new method for symbolic music classification is proposed,

More information

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik Discriminative and Generative Models for Image-Language Understanding Svetlana Lazebnik Image-language understanding Robot, take the pan off the stove! Discriminative image-language tasks Image-sentence

More information

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies Judy Franklin Computer Science Department Smith College Northampton, MA 01063 Abstract Recurrent (neural) networks have

More information

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016 6.UAP Project FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System Daryl Neubieser May 12, 2016 Abstract: This paper describes my implementation of a variable-speed accompaniment system that

More information

arxiv: v3 [cs.lg] 6 Oct 2018

arxiv: v3 [cs.lg] 6 Oct 2018 CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORKS WITH BINARY NEURONS FOR POLYPHONIC MUSIC GENERATION Hao-Wen Dong and Yi-Hsuan Yang Research Center for IT innovation, Academia Sinica, Taipei, Taiwan {salu133445,yang}@citi.sinica.edu.tw

More information

arxiv: v2 [cs.cv] 23 May 2017

arxiv: v2 [cs.cv] 23 May 2017 Multi-View Image Generation from a Single-View Bo Zhao1,2 Xiao Wu1 1 Zhi-Qi Cheng1 Southwest Jiaotong University 2 Hao Liu2 Jiashi Feng2 National University of Singapore arxiv:1704.04886v2 [cs.cv] 23 May

More information

Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network

Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network Indiana Undergraduate Journal of Cognitive Science 1 (2006) 3-14 Copyright 2006 IUJCS. All rights reserved Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network Rob Meyerson Cognitive

More information

arxiv: v2 [eess.as] 24 Nov 2017

arxiv: v2 [eess.as] 24 Nov 2017 MuseGAN: Multi-track Sequential Generative Adversarial Networks for Symbolic Music Generation and Accompaniment Hao-Wen Dong, 1 Wen-Yi Hsiao, 1,2 Li-Chia Yang, 1 Yi-Hsuan Yang 1 1 Research Center for Information

More information

arxiv: v1 [cs.sd] 17 Dec 2018

arxiv: v1 [cs.sd] 17 Dec 2018 Learning to Generate Music with BachProp Florian Colombo School of Computer Science and School of Life Sciences École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland florian.colombo@epfl.ch arxiv:1812.06669v1

More information

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn Introduction Active neurons communicate by action potential firing (spikes), accompanied

More information

Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music

Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music Gaussian Mixture Model for Singing Voice Separation from Stereophonic Music Mine Kim, Seungkwon Beack, Keunwoo Choi, and Kyeongok Kang Realistic Acoustics Research Team, Electronics and Telecommunications

More information

Notes on David Temperley s What s Key for Key? The Krumhansl-Schmuckler Key-Finding Algorithm Reconsidered By Carley Tanoue

Notes on David Temperley s What s Key for Key? The Krumhansl-Schmuckler Key-Finding Algorithm Reconsidered By Carley Tanoue Notes on David Temperley s What s Key for Key? The Krumhansl-Schmuckler Key-Finding Algorithm Reconsidered By Carley Tanoue I. Intro A. Key is an essential aspect of Western music. 1. Key provides the

More information

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations

Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations Chord Label Personalization through Deep Learning of Integrated Harmonic Interval-based Representations Hendrik Vincent Koops 1, W. Bas de Haas 2, Jeroen Bransen 2, and Anja Volk 1 arxiv:1706.09552v1 [cs.sd]

More information

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG? NICHOLAS BORG AND GEORGE HOKKANEN Abstract. The possibility of a hit song prediction algorithm is both academically interesting and industry motivated.

More information

Analysis and Clustering of Musical Compositions using Melody-based Features

Analysis and Clustering of Musical Compositions using Melody-based Features Analysis and Clustering of Musical Compositions using Melody-based Features Isaac Caswell Erika Ji December 13, 2013 Abstract This paper demonstrates that melodic structure fundamentally differentiates

More information

Audio Cover Song Identification using Convolutional Neural Network

Audio Cover Song Identification using Convolutional Neural Network Audio Cover Song Identification using Convolutional Neural Network Sungkyun Chang 1,4, Juheon Lee 2,4, Sang Keun Choe 3,4 and Kyogu Lee 1,4 Music and Audio Research Group 1, College of Liberal Studies

More information

arxiv: v1 [cs.cv] 16 Jul 2017

arxiv: v1 [cs.cv] 16 Jul 2017 OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS Eelco van der Wel University of Amsterdam eelcovdw@gmail.com Karen Ullrich University of Amsterdam karen.ullrich@uva.nl arxiv:1707.04877v1

More information

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues

Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Laughbot: Detecting Humor in Spoken Language with Language and Audio Cues Kate Park, Annie Hu, Natalie Muenster Email: katepark@stanford.edu, anniehu@stanford.edu, ncm000@stanford.edu Abstract We propose

More information

Learning Musical Structure Directly from Sequences of Music

Learning Musical Structure Directly from Sequences of Music Learning Musical Structure Directly from Sequences of Music Douglas Eck and Jasmin Lapalme Dept. IRO, Université de Montréal C.P. 6128, Montreal, Qc, H3C 3J7, Canada Technical Report 1300 Abstract This

More information

Evaluating Melodic Encodings for Use in Cover Song Identification

Evaluating Melodic Encodings for Use in Cover Song Identification Evaluating Melodic Encodings for Use in Cover Song Identification David D. Wickland wickland@uoguelph.ca David A. Calvert dcalvert@uoguelph.ca James Harley jharley@uoguelph.ca ABSTRACT Cover song identification

More information

Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis

Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis 1 Introduction In this work we propose a music genre classification method that directly analyzes the structure

More information

Computational Modelling of Harmony

Computational Modelling of Harmony Computational Modelling of Harmony Simon Dixon Centre for Digital Music, Queen Mary University of London, Mile End Rd, London E1 4NS, UK simon.dixon@elec.qmul.ac.uk http://www.elec.qmul.ac.uk/people/simond

More information

GENERATING NONTRIVIAL MELODIES FOR MUSIC AS A SERVICE

GENERATING NONTRIVIAL MELODIES FOR MUSIC AS A SERVICE GENERATING NONTRIVIAL MELODIES FOR MUSIC AS A SERVICE Yifei Teng U. of Illinois, Dept. of ECE teng9@illinois.edu Anny Zhao U. of Illinois, Dept. of ECE anzhao2@illinois.edu Camille Goudeseune U. of Illinois,

More information

Classification of Timbre Similarity

Classification of Timbre Similarity Classification of Timbre Similarity Corey Kereliuk McGill University March 15, 2007 1 / 16 1 Definition of Timbre What Timbre is Not What Timbre is A 2-dimensional Timbre Space 2 3 Considerations Common

More information

Week 14 Music Understanding and Classification

Week 14 Music Understanding and Classification Week 14 Music Understanding and Classification Roger B. Dannenberg Professor of Computer Science, Music & Art Overview n Music Style Classification n What s a classifier? n Naïve Bayesian Classifiers n

More information

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007 A combination of approaches to solve Tas How Many Ratings? of the KDD CUP 2007 Jorge Sueiras C/ Arequipa +34 9 382 45 54 orge.sueiras@neo-metrics.com Daniel Vélez C/ Arequipa +34 9 382 45 54 José Luis

More information

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem.

The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem. State Reduction The reduction in the number of flip-flops in a sequential circuit is referred to as the state-reduction problem. State-reduction algorithms are concerned with procedures for reducing the

More information

/$ IEEE

/$ IEEE 564 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 18, NO. 3, MARCH 2010 Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals Jean-Louis Durrieu,

More information

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION

AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION AUTOREGRESSIVE MFCC MODELS FOR GENRE CLASSIFICATION IMPROVED BY HARMONIC-PERCUSSION SEPARATION Halfdan Rump, Shigeki Miyabe, Emiru Tsunoo, Nobukata Ono, Shigeki Sagama The University of Tokyo, Graduate

More information

Singer Recognition and Modeling Singer Error

Singer Recognition and Modeling Singer Error Singer Recognition and Modeling Singer Error Johan Ismael Stanford University jismael@stanford.edu Nicholas McGee Stanford University ndmcgee@stanford.edu 1. Abstract We propose a system for recognizing

More information

Adaptive Key Frame Selection for Efficient Video Coding

Adaptive Key Frame Selection for Efficient Video Coding Adaptive Key Frame Selection for Efficient Video Coding Jaebum Jun, Sunyoung Lee, Zanming He, Myungjung Lee, and Euee S. Jang Digital Media Lab., Hanyang University 17 Haengdang-dong, Seongdong-gu, Seoul,

More information

Automatic Music Genre Classification

Automatic Music Genre Classification Automatic Music Genre Classification Nathan YongHoon Kwon, SUNY Binghamton Ingrid Tchakoua, Jackson State University Matthew Pietrosanu, University of Alberta Freya Fu, Colorado State University Yue Wang,

More information

A Bayesian Network for Real-Time Musical Accompaniment

A Bayesian Network for Real-Time Musical Accompaniment A Bayesian Network for Real-Time Musical Accompaniment Christopher Raphael Department of Mathematics and Statistics, University of Massachusetts at Amherst, Amherst, MA 01003-4515, raphael~math.umass.edu

More information