Sequence generation and classification with VAEs and RNNs

Jay Hennig 1 * Akash Umakantha 1 * Ryan Williamson 1 * 1. Introduction Variational autoencoders (VAEs) (Kingma & Welling, 2013) are a popular approach for performing unsupervised learning that can also be used as generative models. VAEs model the data distribution as a nonlinear transformation of unobserved latent variables. Inference of the latent variables given the observed data is performed via a recognition (encoding) model, while the transformation from latent variables to data is modeled by a generating (decoding) model. The distribution of latent variables is encouraged during training to take a simple form, such as a normal distribution, making the data generating process straightforward. This process has shown to be successful at generating complicated forms of data such as digits and faces (Kingma & Welling, 2013). However, when the data is multi-modal, VAEs do not provide an explicit mechanism for specifying which mode our generated sample should come from. For example, we may wish to ensure we generate a 2 and not a 4. One possible approach is to model the latent distribution as a mixture distribution such as a mixture of Gaussians (Doersch, 2016). However, this approach is difficult to train and often not successful in practice (Makhzani et al., 2015). In this paper we introduce the Classifying VAE, in which we train a classifier to detect the mode of each data point simultaneously to training the recognition and generation models. By modeling the probability that a data point belongs to each mode as a Logistic Normal distribution, we can use the reparameterization trick of Stochastic Gradient Variational Bayes to sample from this distribution during training. In the case of generating sequence data, VAEs can be successfully combined with recurrent neural networks (RNNs). The combination of VAEs and RNNs form a class of models we refer to broadly as variational RNNs (vrnns) (Chung et al., 2015; Boulanger-Lewandowski * Equal contribution 1 Carnegie Mellon University, Pittsburgh, PA. Correspondence to: Jay Hennig <jhennig@andrew.cmu.edu>, Akash Umakantha <aumakant@andrew.cmu.edu>, Ryan Williamson <rcw1@andrew.cmu.edu>. Proceedings of the 34 th International Conference on Machine Learning, Sydney, Australia, 2017. JMLR: W&CP. Copyright 2017 by the author(s). et al., 2012; Fraccaro et al., 2016; Bayer & Osendorfer, 2014). When each sequence comes from a discrete class, we can similarly train a Classifying vrnn. We demonstrate that our models are useful for generating data from one of several discrete modes by training our models on polyphonic music data. In music, segments of music conform to a particular key, with different notes being appropriate in different keys. When generating music, therefore, it is important to be able to generate music from only one key at a time. Previous approaches often preprocess the training data to be in only one of two keys (Boulanger-Lewandowski et al., 2012), compared to up to 24 keys present in the original music. Even in this simplified setting, we show that models such as VAEs and vrnns are prone to generating samples that do not stay in key, resulting in dissonance and poor-sounding music. By contrast, we show that our Classifying VAE and Classifying vrnn can generate samples that are as in key as the original training data, even when the training data includes songs in 18 different keys. 2. Background & Related Work Modeling sequence data The standard application of RNNs to sequence data is to have them model the probabilistic distribution over sequences as follows (Bayer & Osendorfer, 2014): p(x 1:T ) = T p(x t x 1:t 1 ) t=1 The output of the RNN at each timestep t is then interpreted as the parameters of the term p(x t x 1:t 1 ). Boulanger-Lewandowski et al. (2012) combined an RNN with a restricted Boltzmann machine (RBM) in order to model correlations amongst the outputs at each timestep. Their paper established a baseline for music generation by reporting the performance of a variety of simple models such as GMMs, RBMs, HMMs, and n-grams. More recently, there has been a lot of interest in the variational Bayes approach for training deep generative models. The original paper by Kingma & Welling (2013),

though not related to modeling sequence data, introduced the Stochastic Gradient Variational Bayes (SGVB) estimator, and the variational autoencoder (VAE). The VAE model used a variational approximation to learn an encoder model, which mapped input data to a stochastic latent representation, and a decoder model, which mapped this latent representation back into the data space. Mathematically their model sought to maximize the following estimate of the variational lower bound: L(θ, φ; x) = D KL (q φ (z (i,l) x (i) ), p θ (z)) + E qφ (z x) log p θ (x (i) z (i,l) ) Their key insight was what they called the the reparametrization trick. The problem with classical backpropagation in a deep generative model framework is that it requires calculating gradients of a expectation with respect to a random variable z. In general, this is difficult. The reparametrization trick rewrites the random variable as a deterministic variable using some other random variable ɛ. The objective is then rewritten as the empirical lower bound: L(θ, φ; x) = D KL (q φ (z (i,l) x (i) ), p θ (z)) + 1 L L log p θ (x (i) z (i,l) ) l=1 where z = g φ (ɛ (i,l), x (i) ) and ɛ (i,l) p(ɛ). When z N (µ(x), σ 2 (x)), the KL term is reparametrized as follows: D KL (q φ (z (i,l) x (i) ), p θ (z)) = 1 J (1 + log σj 2 (x) (µ j (x)) 2 σj 2 (x)) 2 j=1 Recent papers have applied this variational perspective to modeling sequence data. Bayer & Osendorfer (2014) developed a model called STORN, which introduced stochastic latent variables to an RNN model, fit with SGVB. The sampled latent variables are assumed to be independent over time. Figure 1 shows a diagram of the STORN model. Next, Chung et al. (2015) developed what they called an Autoencoding RNN (VRNN), which extended STORN so that the latent variables depend on the sequence history and therefore evolve over time. Finally, work by Fraccaro et al. (2016) attempted to merge the RNN and state-space model perspective. Their approach, which they call a Stochastic RNN (SRNN), explicitly separates the stochastic and deterministic components of their model. Figure 1. The generative models for the VAE (Kingma & Welling, 2013) and vrnn (Bayer & Osendorfer, 2014; Chung et al., 2015) in black, with the additions of the Classifying VAE and Classifying vrnn shown in blue. z are continuous latent variables, while w specifies the class of X (in the Classifying VAE) or the class of a sequence X 1,..., X T (in the Classifying vrnn). Application to music In Western music, there are twelve possible pitch classes one can play (A Bb B C Db D Eb E Fb F Gb G Ab), and most songs are written using only a subset of these. The key of a song (or a part of a song) refers to the central tonic note (e.g., C, D, etc.), along with its mode (e.g., major or minor). For example, music written in the key of C major will tend to use only notes without sharps or flats (A B C D E F G), with C being the tonic note that most melodies end on. Music in the key of C minor, by contrast, will have the same tonic note (C) but a different overall set of notes (C D Eb F G Ab Bb). In this paper we consider 24 possible keys, referring to each of twelve tonic notes in either the major or minor mode. In this paper we model music data as a series of 88- dimensional binary vectors, X t {0, 1} 88, where the j th entry, X j t, represents whether or not the j th key on an 88- key piano was played at time t. We refer to X = {X t t = 1,..., T } as a music sequence with length T. If X is in the key of C major, from a probabilistic standpoint this means P (X j t ) 0 if the note referred to by X j t is not part of the key of C major. While some songs do occasionally change keys, we assume that the key is constant as long as T is not too large. 3. Methods Classifying Variational Autoencoder The joint distribution we consider is P (X, z, w) = P (x z, w)p (z w)p (w) where X is the observed data, and z and w are unobserved latent variables. We suppose that w is a vector of dimension d, specifying the probabilities that X comes from one of d distinct classes. We now follow the main idea of the variational autoen-

coder, which begins with constructing a variational lowerbound on the log of the marginal likelihood P (X) = P (X z, w)p (z w)p (w) w z. We can write the following equality: logp (X) D[Q(z, w X) P (z, w X)] = L(X) = E z,w Q [log P (X z, w)] D[Q(z, w X) P (z, w)] where D is the KL-divergence (which is always nonnegative), and L(X) is the evidence lower-bound, or ELBO. Applying the chain rule on the KL terms lets us rewrite the left-hand side as: log P (X) D[Q(z w, X) P (z w, X)] D[Q(w X) P (w X)] We can similarly rewrite the right-hand side: E z,w Q [log P (X z, w)] D[Q(z w, X) P (z w)] D[Q(w X) P (w)] Above, P (z w) and P (w) are priors, while Q is an arbitrary distribution. Critically, Q will be parameterized by a neural network, and we will optimize Q (with gradient descent) so as to maximize the ELBO, thereby maximizing the marginal likelihood. To keep our lower-bound on the marginal likelihood tight, we must ensure the KL divergences on the left-hand side are small. Typically, we don t have access to P (z X, w). However, we suppose that when training we do have access to the true w. So instead of minimizing D[Q(w X) P (w X)], we can minimize the cross-entropy between Q(w X) and the true w. In other words, minimizing this portion of the loss will encourage Q(w X) to classify w. By simultaneously minimizing cross-entropy loss while maximizing the the ELBO, we keep the lower bound on the data likelihood tight while also maximizing it. As with a standard VAE, we use the Stochastic Gradient Variational Bayes (SGVB) estimator of L(X), whereby we draw samples of w Q(w X) and z Q(z w, X) for each minibatch during training. One of the limitations of the VAE is that the latent variables z and w cannot be discrete, otherwise we could not perform backpropagation on the parameters of the neural network that specify Q. In this case, we wish for w to be the probabilities that X belongs to each of the d classes. We therefore parameterize Q(w X) as a Logistic Normal distribution. Samples w Q(w X) = LN (µ, Σ) have the property that 0 w i 1 for i = 1,..., d, and d i=1 w i = 1, so that w can be interpreted as our estimated probability that X belongs to each of the d classes. Critically, to use SGVB, we must be able to apply the reparameterization trick and generate samples of z and w as differentiable, deterministic functions of X and some auxiliary variables ɛ with independent marginals. (It is important to note that samples from the Dirichlet, another option for Q(w X), cannot be trivially written in this form.) A common choice for the distribution of z is a Normal distribuion, to which the reparameterization trick applies. For w, samples from the Logistic Normal distribution can be generated deterministically from a sample from a Normal distribution with the same parameters. Specifically, if y N (µ, Σ) is a sample from a Normal distribution with y R d 1, then w LN (µ, Σ) is a sample from a Logistic Normal if we set w j = w d = 1 1+ d 1 j=1 yj. 4. Experiments e y j 1+ d 1 j=1 yj for j = 1,..., d 1 and Music generation. We apply our method to generate sequences of polyphonic music in the form of piano-roll data. For training data we use the entire corpus of 382 fourpart harmonized chorales by J.S. Bach ( JSB Chorales ), obtained from the Python package music21 (Cuthbert & Ariza, 2010). This corpus was converted to piano-roll notation, a binary matrix denoting when each of 88 notes (from A0 to C8) was played over time (with time divided into quarter or eighth notes, depending on the data set). One of the most popular algorithms for detecting the key of a musical sequence is the Krumhansl-Schmuckler algorithm (Krumhansl, 2001). This algorithm finds the proportion of each pitch class present in a sequence, and compares this with the proportions expected in each key. We used the implementation of this algorithm in music21 to establish the ground-truth key of each musical sequence in our training data. The JSB Chorales corpus includes songs in 18 different keys. When Classifying the key, we treat major and minor scales that have the same notes (e.g., C major and A minor) as the same key, resulting in 10 distinct key classes. Additionally, for comparison with previous work on this data set, we also present results for models trained on songs transposed so that each song is in either C major or C minor. Finally, we perform further experiments on the Pianomidi.de data set, with songs transposed to either C major or C minor, as seen in Boulanger-Lewandowski et al. (2012). Implementation of the VAE and Classifying VAE. To model temporal data with a VAE and Classifying VAE, we considered two options. First, we may consider each timestep in isolation, autoencoding X t to X t via z t and w t. Note that in all models, w t is constant for all t referring to the same song. Our second option is to autoencode X t to X t+1. We found that the former approach led to sam-

ples with better musical qualities and lower cross-validated reconstruction error. For the Classifying VAE, we assume the following priors on z and w: For Q, we use: P (z w) = P (z) = N (0, I) P (w) = LN (0, I) Q(w X) = LN (µ w (X), diag[σ 2 w(x)]) Q(z X, w) = N (µ z (X, w), diag[σ 2 z(x, w)]) where µ w and σ 2 w are implemented as the outputs of a neural network (the classifier ), and µ z and σ 2 z implemented as the outputs of a different neural network (the encoder ). Note that the outputs of the encoder are a function of X as well as w, so when training we must draw a sample w Q(w X) as described above before computing µ z (X, w) and σ 2 z(x, w). Note that the true w is used during training only in the loss function. Finally, we assume P (X w, z) is an 88D Bernoulli distribution with probabilities p(w, z) as the output of a neural network (the decoder ). For the classifier, encoder, and decoder network, we use multilayer perceptrons (one for each network) with one hidden layer and ReLu activation. In practice, we found that Classifying the key of a musical sequence is not difficult, even when the training data includes songs in 18 different keys. To implement the standard VAE, we ignore w in all equations above, and use only an encoder and decoder network as in Kingma & Welling (2013). Implementation of the vrnn and Classifying vrnn. For our standard vrnn, we implement a network similar to STORN (Bayer & Osendorfer, 2014). In this case, the encoder and decoder networks are each replaced with an LSTM with ReLu activation, followed by a single dense layer. The Classifying vrnn is similar except for the addition of a classifier, implemented as in the Classifying VAE. Critically, we classify the key using an entire sequence of X. We leave the sequence length as a hyperparameter to be chosen. A sample w Q(w X) is then fed into the encoder and decoder networks at each timestep of the sequence. Generating music samples. When generating samples with a VAE, typically one would draw z t N(0, I), pass this value through the decoder to get p(z t ), and then sample X t from those probabilities. However, we found that the generated samples had better musical qualities if we drew z t Q(z X t 1 ). In other words, we sample z t from the encoding of X t 1, which is the previously generated sample. We used this approach when generating samples for all models. Training details. We built all models using Keras with a Tensorflow backend. All weights were initialized to a random sample N (0, 0.01), and trained with stochastic gradient descent using Adam ( Kingma & Ba (2014)). To prevent overfitting, we stopped training when the loss on our validation data did not increase for two consecutive epochs. We found that using weight normalization (Salmimans & Kingma, 2016) resulted in faster convergence. We performed a hyperparameter search using grid search, choosing the hyperparameters that resulted in the lowest loss on validation data (see Tables 1, 2, and 3). For the VAE and Classifying VAE, batch sizes in the range of 10-50 were preferred, along with a latent dimensionality around 4-8. For the vrnn and Classifying vrnn, a batch size of 50-200 was sufficient, along with a latent dimensionality of around 8-16, and a sequence length of around 8-16. For all hidden perceptron layers (i.e., for the encoders, decoders, and classifiers) we found a dimensionality equal to 88, the input dimensionality, worked best. In both the VAE and vrnn models, hyperparameter search found that the Classifying models generally needed fewer latent dimensions than the unsupervised models. We attribute this to the fact that the extra Classifying network in the Classifying models adds predictive power and model complexity. Experiment 1. VAE Classifying VAE Figure 2. Comparison of the inferred values of z for the VAE and Classifying VAE from held-out test data with songs in the key of either C major (red points) or C minor (blue points). The Classifying VAE encodes z after first inferring the probability that X came from a song in one of the two keys. The Classifying VAE s encoded values of z appear to more closely match the imposed prior z N(0, I) than do those of the VAE. The first question we ask is: how does the latent representation differ (or not differ) between the VAE and Classifying VAE? We first explore the differences between the two models when the dimensionality of the continuous latent variables z is 2, and the training data contained songs in

JSB, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE 20 8 88 88 n/a Classifying VAE 5 4 88 88 n/a vrnn 50 16 88 88 16 Classifying vrnn 50 8 88 88 16 Table 1. Results of hyperparameter search for models trained on the 2 key version of JSB Chorales. JSB, 10 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE 50 8 88 88 n/a Classifying VAE 10 8 88 88 n/a vrnn 50 16 88 88 8 Classifying vrnn 100 4 88 88 16 Table 2. Results of hyperparameter search for models trained on the 10 key version of JSB Chorales. either C major or C minor. In this case, we can easily visualize the representation of the latent space learned by each network when encoding music in one of two keys. Additionally, because the VAE and Classifying VAE are not temporal models, the encoding and decoding of z t depends only on the notes played at the current timestep, X t, rather than on the sequence history. Figure 3. Heatmap of decoded probabilities of X given different values of z and w, for a Classifying VAE trained on music in C major or C minor. For the same value of z, the value of w determines whether the decoded notes are appropriate for either C major (when w = 0) or C minor (when w = 1). There are several differences in the latent spaces of the VAE and Classifying VAE (Fig 2). In the VAE, location in the latent space is important in determining what key the note/song is in. Notes in C minor (blue) have lower variance than notes in C major. Thus, a latent value near the center (e.g. [0,0]) will be more likely to be a note from C minor, and values near the edge (e.g. [0,2]) will be more likely to be a note from C major. Furthermore, samples/songs that have latents in the center will likely change keys over time. In the Classifying VAE, the additional parameter w encodes the key of the song. This allows the representations of of C major and C minor to be highly overlapping in the latent space. Furthermore, both latent distributions more closely resemble the prior distribution N(0, I). Indeed, since the Classifying VAE is explicitly modeling the key of the class, the burden of clustering the keys (as in the VAE) is removed from the latent space. Fig 3 shows that in the Classifying VAE, the same latent variable values (z) can give rise to very different notes and chords in the input space depending on the value of w. Experiment 2. Having a better intuition of the latent space, we now move to evaluating and comparing the performance of the various models relative to each other. We trained the VAE, Classifying VAE, vrnn, and Classifying vrnn on songs in 3 different datasets: JSB Chorales with 2 keys, JSB Chorales with 18 keys (10 distinct keys), and Piano-midi.de with 2 keys. We no longer constrained the latent dimensionality to be 2, but instead use the results of our hyperparameter search (Tables 1, 2, 3). First, we compared the average negative log likelihood (i.e. reconstruction error) of each model on held-out validation data. Note that since the VAEs are auto-encoding and the vrnns are predicting sequences, it is not necessarily fair to compare the two. The real comparison is between Classifying vs. non-classifying models. From the results in Table 4, it was unclear based on this metric alone whether classification models improve performance: in come cases it was better (JSB 2 keys, Classifying VAE vs VAE), sometimes worse, and in other cases comparable to the unsupervised model. We recognize that log likelihood is not

PianoMidi, 2 keys batch size latent dim hidden units, latent hidden units, classifier sequence length VAE 50 8 88 88 n/a Classifying VAE 20 4 88 88 n/a vrnn 200 16 88 88 8 Classifying vrnn 50 4 88 88 8 Table 3. Results of hyperparameter search for models trained on the 2 key version of PianoMidi.de. Data set VAE Classifying VAE vrnn Classifying vrnn JSB Chorales (2 keys) 5.32 4.38 5.35 6.065 JSB Chorales (10 keys) 4.578 4.617 3.826 4.866 Piano-midi.de (2 keys) 8.02 10.54 7.31 8.66 Table 4. Average negative log-likelihoods on test data for models fit to data transposed to C major or C minor in the JSB and PianoMidi datasets. Hyperparameters used are those in Table 1 and Table 3 a perfect metric for example, generative adversarial networks (GANs) have been gaining in popularity lately due to their ability to generate very realistic samples, yet their log likelihoods are poor relative to other models like VAEs. Thus, we turn to other performance metrics to compare these models. One very important metric is how well a song stays in key. Indeed, the inspiration for this model was to generate sequences that do so better than the standard VAE and vrnn. We generated samples from each model after seeding the networks with musical sequences from held-out test data. Each model generated music for the next T = 32 timesteps, for N = 7000 different seed sequences in the test data. To measure how well these generated samples preserve the key of the seed sequence, we calculated the proportion of notes in the generated sequences that were in the key of the seed sequence. We observed that the seed sequences themselves ( Data in Fig. 4) were not perfectly in key; this is presumably because the ground-truth key of the test data was measured per song and not per sequence. In any case, we found that the Classifying models all generated samples that were as in key as the test data itself. This is in comparison to the VAE and vrnn models, which generated samples whose notes were not consistently in key. Note that chance performance in this metric is 67% because keys are composed of eight of the twelve pitch classes, and so randomly selecting notes from all twelve pitches classes would result in 8 of 12 notes being in key. Furthermore, in experiments (not shown) where we flip the value of w such that it now represents the opposite key, we observed that the percentage of notes in key drastically decreased. This suggests that the value of w is in fact controlling how well generated notes stay in key. Figure 4. Comparison of the percentage of notes from generated samples that are in key, after training on data in 18 different keys. Gray dotted line shows chance performance of 67%. Boxes depict the 25th, 50th, and 75th percentiles across N = 7000 samples. Each model generated a sample for T = 32 timesteps after being seeded with a musical sequence from held-out test data in one of eighteen different keys. The Classifying vrnn showed similar performance when samples were generated by inferring the key of the seed (w) using its classifier network, or when being given the true key. These percentages were similar to the percentage of notes in key in the test data ( Data ). Ideally the benefit of consistent, selectable key would come without any penalty in music quality. We therefore next compared the samples from each model on other musicality metrics to assess how the quality of the music compared to that of the original data set. In doing these comparisons, we generated samples from a set of 3450 seed sequences, all of which had been transposed to the key of C major. We assessed three musicality metrics in addition to the key consistency described above. First, we assessed the number of tones played on average at each time point (indicated as SimulTones in Fig. 5). We found that the Classifying vrnn had a slightly higher number of tones per time

Sequence generation and classification with VAEs and RNNs step than the other models and the original data. We next measured the tone span, which is the average difference between the highest and lowest tones played across all samples. We found that the Classifying vrnn was much more similar to the original data than the VAE or the vrnn. Finally, we measured polyphony, which is the percentage of time points per sample that more than one note is played. Here we set a maximum number of notes per time step at 5, since more than 5 notes per time step would be a significant departure from the training music (which had 4 tones at nearly all time steps). All models show less polyphony that the original music, with the Classifying vrnn having the lowest polyphony. In summary, the Classifying vrnn does better than the VAE and the vrnn at maintaining a key and in mimicking the tone span of the original music. However, the Classifying vrnn had a slightly higher tone span than the original and a lower polyphony. Figure 6. Using a Classifying VAE to separate style and content for MNIST. Top: Decoded digits when varying the continuous latents z while keeping the class labels w fixed, for two examples. Bottom: The left column shows an example digit from the heldout test data, and the numbers to the right show the model s generated analogies to this digit, found by using the same values of z but varying the class label w. 5. Conclusion Figure 5. Comparison of VAE, vrnn, and Classifying vrnn (cvrnn) on three musicality metrics. SimulTone indicates the number of tones played on each time step. ToneSpan indicates the average distance across samples between the highest tone and the lowest tone. Polyphony indicates the percentage of samples with than 2 tones. Note that each bar group is normalized by the maximum value in each bar group to allow for comparison on the same axes. Experiment 3. As another validation of our proposed model, we trained a Classifying VAE on MNIST data and show that it can perform style and content separation (similar to Fig. 4) by using continuous latent variables z along with class labels w, where each class is a different digit. The implementation details are the same as before, except for our classifier network we use a multilayer perceptron with two hidden layers containing 512 units each, with 20% dropout. Figure 6 shows examples of digits generated by the model both when holding w fixed while varying z (top), and when holding z fixed while varying w (bottom). Kingma et al. (2014) proposed a generative semisupervised model that is related to our work. In their generative process, the continuous latents z have a mulivariate normal prior, class variables y are generated from a multinomaial (and treated as latents if data is unlabelled) and the data generative process is a function of y, z, and neural network parameters. Our proposed model differs in a few ways. First, their model assumes marginal independence between y and z given data x (i.e. q(y, z x) = q(y x)q(z x)). Our factorization differs in that we do not assume independence: q(y, z x) = q(z y, x)q(y x). Even without this assumption, our model is able to recover a representation that disentangles style and content. Secondly, in our inference network for q(y x), we parameterize q(y x) as a log normal distribution. Since the log normal distribution is continuous, we can implement stochastic backpropagation (SGVB) through our classifier network in addition to our network for q(z x, y). This enables training both the classifier and autoencoder simultaneously, in contrast to the model in Kingma et al. (2014) which must train these layers separately. Additionally, we not consider training in the semi-supervised case of partially-labeled data. Another architecture applied to similar problems as VAEs is the generative adversarial network (GAN) (Goodfellow et al., 2014). Two GAN structures have been developed that

are particularly relevant to the model proposed in this paper. First is the Adversarial Autoencoder (Makhzani et al., 2015). This model introduces an adversarial network in the output of the latent variable layer of an autoencoder. The role of the adversarial layer is to force the encoder to map input data to latents such that the distribution of the mapped data matches the prior. This model has two advantages over the VAE. First it does not require a reparametrizeable prior, and can therefore employ any arbitrary prior as long as it can be sampled from. Second, the model generally produces representations in the latent space that are closer to the imposed prior. This is especially true in multi-modal distributions (e.g., mixture of Gaussians) which could plausibly be used to generate data from specific classes. In addition, adversarial GANs can incorporate labels in the hidden layer (similar to the w latent variable used here), allowing for generation of data with specific class labels. This architecture has only been constructed for static data and a sequential model has not yet been developed. Critically, the training of GANs is non-trivial and therefore less straightforward than for the VAE. A second relevant GAN architecture is the continuous Recurrent Neural Network Generative Adversarial Network (c-rnn-gan) (Mogren, 2016). This model is essentially the adversarial equivalent to the vrnn model discussed earlier. This model uses a uni-directional recurrent generative model and a bi-directional discriminator model. This model has been applied successfully to music generation, although we leave comparisons between this model and the one proposed here for future work. We note that the c-rnn-gan does not incorporate key labels and would therefore likely suffer from the same limitations on key stability as the vrnn when trained on samples that have not been transposed to a single key. In summary we developed an extension of the variational autoencoder that allows for specification of class labels in terms of categorical latent variables. We demonstrate that this method is effective in generating data instances from selected classes using two applications: one where class label determines the key of the generated music and a second where the class label determines the value of the MNIST digit. In the music generation task our method demonstrated superior stability of key over the sample sequence with comparable performance on other non-key related metrics of musicality. We showed that the Classifying VAE is able to independently encode style and content in both the music generation and MNIST data sets. An interesting future direction is to see whether the Classifying vrnn is similarly able to separate style and content in its temporal sequences. For example, given a song in a particular key, can we somehow vary the class label and transpose the song to a different key? Another potential benefit of this model would be to generate songs following a particular chord progression. This would involve changing the interpretation of the class label w to be the current chord being played (instead of the key of the entire song), which may lead to an effective method for generating melodies to play on top of a given chord progression. References Bayer, Justin and Osendorfer, Christian. Learning stochastic recurrent networks. arxiv preprint arxiv:1411.7610, 2014. Boulanger-Lewandowski, Nicolas, Bengio, Yoshua, and Vincent, Pascal. Modeling temporal dependencies in high-dimensional sequences: Application to polyphonic music generation and transcription. In ICML, 2012. Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C., and Bengio, Yoshua. A recurrent latent variable model for sequential data. arxiv, 1506.02216, 2015. Cuthbert, Michael Scott and Ariza, Christopher. Music21, a toolkit for computer-aided musicology and symbolic music data. In International Society for Music Information Retrieval Confrence (ISMIR), 2010. Doersch, Carl. Tutorial on variational autoencoders. arxiv preprint arxiv:1606.05908, 2016. Fraccaro, Marco, Sønderby, Søren Kaae, Paquet, Ulrich, and Winther, Ole. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems, pp. 2199 2207, 2016. Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. In NIPS, 2014. Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arxiv preprint arxiv:1412.6980, 2014. Kingma, Diederik P. and Welling, Max. Auto-encoding variational bayes. arxiv, 1312.6114, 2013. Kingma, Diederik P, Mohamed, Shakir, Rezende, Danilo Jimenez, and Welling, Max. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581 3589, 2014. Krumhansl, Carol L. Cognitive foundations of musical pitch. Oxford University Press, 2001.

Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, and Goodfellow, Ian J. Adversarial autoencoders. arxiv, 1511.05644, 2015. Mogren, Olof. C-rnn-gan: Continuous recurrent neural networks with adversarial training. arxiv, 1611.09904, 2016. Salmimans, Tim and Kingma, Diederik P. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arxiv preprint arxiv:16.02.07868, 2016. Sequence generation and classification with VAEs and RNNs