Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Similar documents
Audio and Video II. Video signal +Color systems Motion estimation Video compression standards +H.261 +MPEG-1, MPEG-2, MPEG-4, MPEG- 7, and MPEG-21

An Overview of Video Coding Algorithms

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Contents. xv xxi xxiii xxiv. 1 Introduction 1 References 4

Video Transmission. Thomas Wiegand: Digital Image Communication Video Transmission 1. Transmission of Hybrid Coded Video. Channel Encoder.

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Digital Audio and Video Fidelity. Ken Wacks, Ph.D.

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

DISTRIBUTION STATEMENT A 7001Ö

DCI Requirements Image - Dynamics

Take a Break, Bach! Let Machine Learning Harmonize That Chorale For You. Chris Lewis Stanford University

Chapter 10 Basic Video Compression Techniques

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

Deep learning for music data processing

PERCEPTUAL QUALITY OF H.264/AVC DEBLOCKING FILTER

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

Chapter 2 Introduction to

Content storage architectures

Real Time PQoS Enhancement of IP Multimedia Services Over Fading and Noisy DVB-T Channel

LabView Exercises: Part II

RoboMozart: Generating music using LSTM networks trained per-tick on a MIDI collection with short music segments as input.

Principles of Video Compression

Sampling Worksheet: Rolling Down the River

Smart Coding Technology

CHAPTER-9 DEVELOPMENT OF MODEL USING ANFIS

Experiment 13 Sampling and reconstruction

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Multimedia Communications. Video compression

LSTM Neural Style Transfer in Music Using Computational Musicology

Music Composition with RNN

Multimedia Communications. Image and Video compression

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

DisplayPort 1.4 Link Layer Compliance

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Module 3: Video Sampling Lecture 17: Sampling of raster scan pattern: BT.601 format, Color video signal sampling formats

Bach-Prop: Modeling Bach s Harmonization Style with a Back- Propagation Network

Experiments on musical instrument separation using multiplecause

8500 Composite/SD Legalizer and Video Processing Frame Sync

Multicore Design Considerations

Distortion Analysis Of Tamil Language Characters Recognition

Design Project: Designing a Viterbi Decoder (PART I)

Broken Wires Diagnosis Method Numerical Simulation Based on Smart Cable Structure

The Extron MGP 464 is a powerful, highly effective tool for advanced A/V communications and presentations. It has the

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

Hands-On Real Time HD and 3D IPTV Encoding and Distribution over RF and Optical Fiber

Minimax Disappointment Video Broadcasting

Figure 2: Original and PAM modulated image. Figure 4: Original image.

Elasticity Imaging with Ultrasound JEE 4980 Final Report. George Michaels and Mary Watts

Analysis of MPEG-2 Video Streams

Hugo Technology. An introduction into Rob Watts' technology

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

AUDIOVISUAL COMMUNICATION

Film Grain Technology

MULTISIM DEMO 9.5: 60 HZ ACTIVE NOTCH FILTER

Singer Traits Identification using Deep Neural Network

The H.263+ Video Coding Standard: Complexity and Performance

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Video Compression. Representations. Multimedia Systems and Applications. Analog Video Representations. Digitizing. Digital Video Block Structure

Advanced Video Processing for Future Multimedia Communication Systems

SVC Uncovered W H I T E P A P E R. A short primer on the basics of Scalable Video Coding and its benefits

Video Signals and Circuits Part 2

arxiv: v1 [cs.lg] 15 Jun 2016

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Automated sound generation based on image colour spectrum with using the recurrent neural network

Communication Theory and Engineering

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

Overview: Video Coding Standards

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

Synchronization Issues During Encoder / Decoder Tests

Tutorial on the Grand Alliance HDTV System

Understanding PQR, DMOS, and PSNR Measurements

FLEXIBLE SWITCHING AND EDITING OF MPEG-2 VIDEO BITSTREAMS

Detection and demodulation of non-cooperative burst signal Feng Yue 1, Wu Guangzhi 1, Tao Min 1

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

Motion Video Compression

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

Midterm Review. Yao Wang Polytechnic University, Brooklyn, NY11201

A High Performance VLSI Architecture with Half Pel and Quarter Pel Interpolation for A Single Frame

Research Topic. Error Concealment Techniques in H.264/AVC for Wireless Video Transmission in Mobile Networks

ROBUST ADAPTIVE INTRA REFRESH FOR MULTIVIEW VIDEO

gresearch Focus Cognitive Sciences

Project Proposal: Sub pixel motion estimation for side information generation in Wyner- Ziv decoder.

Joint source-channel video coding for H.264 using FEC

CS229 Project Report Polyphonic Piano Transcription

Less is More: Picking Informative Frames for Video Captioning

h t t p : / / w w w. v i d e o e s s e n t i a l s. c o m E - M a i l : j o e k a n a t t. n e t DVE D-Theater Q & A

Spectrum Analyser Basics

Rewind: A Music Transcription Method

Robert Alexandru Dobre, Cristian Negrescu

Supplementary material for Inverting Visual Representations with Convolutional Networks

Understanding IP Video for

ONE SENSOR MICROPHONE ARRAY APPLICATION IN SOURCE LOCALIZATION. Hsin-Chu, Taiwan

CONSTRAINING delay is critical for real-time communication

WYNER-ZIV VIDEO CODING WITH LOW ENCODER COMPLEXITY

PAPER Wireless Multi-view Video Streaming with Subcarrier Allocation

BER MEASUREMENT IN THE NOISY CHANNEL

The Measurement Tools and What They Do

Transcription:

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications Introduction Brandon Richardson December 16, 2011 Research preformed from the last 5 years has shown that the process of training deep networks can be improved by carrying out the learning process in smaller phases called pre-training. Rather than using back propagation to train the whole network, we could just train several small modules separately and then when assembled fine tune the entire network. This idea has only been applied to RNN in recent Stanford research. The current research done at Stanford has divided the training into three phases. First, the input to Hidden Unit layer connections is trained using an auto-encoding objective. Second, the input to the Hidden Unit layer connections are held fixed and the temporal connections are trained with short term memory. And third, the whole network is finetuned using the main objective function. The previous research shows that this new pretraining gives a better reconstruction error then previous methods, in some cases even out performing the best hand-engineered feature selection. This project is being done in conjunction with Quoc Le s research here at Stanford. The main objective of this project is to apply the pre-training RNN algorithm to audio prediction. The Algorithm Architecture: The RNN algorithm is best described by referring to the figure below. This figure shows a 4 layer network, with [300 200 200 200] nodes per layer where the input and output are composed of 1024 values. The vertical connections are autoencoders and the horizontal connections are the temporal connections. The nodes shown in green are only used during pre-training of the autoencoders.

Initial Testing: To verify the functionality of the RNN, and reinforce my understanding of the algorithms architecture I began by training an RNN using the 36000 video training examples Quoc had previously used to obtain results for his current research. This previous research was attempting to predict the last 4 frames in a video from the previous 36 frames. The images shown below were generated using a 4 layer RNN with each interior layer containing 400 nodes. These parameters were chosen to closely match the research that had been previous completed using this frame prediction algorithm, and training data. This model took about a week and half to process it was stopped before it fully converged. The images shown below are the predictions made by this RNN on a test set I designed. The test set is a simple ball of light moving from the bottom left corner of the image to the top right hand corner of the image. Top left, shows the RNN s prediction of the 40 th frame given the previous 36 frames. Top right, shows the actual 40 th frame of the video. Bottom left, shows the actual 36 th frame, the last frame given as an input to the RNN. Bottom middle, shows the auto-encoder output of the 36 th frame. From these images it s easy to see that the RNN s prediction of the 40 th frame closely resembles 36 th frame, the last frame given as an input to the RNN. Therefore, the model appears to poorly capture the temporal effect. Also interesting to note is that the auto-encoder output for the 36 th frame scarcely resembles the actual 36 th frame. This leads me to believe that the final objective function optimization is drastically changing the pre-trained autoencoder parameters, such that the decode parameters properly output a prediction only when the temporal connections are used.

Application of RNN to Audio: After achieving reasonable results from the video data I applied the algorithm to audio data. My goal was to predict the last second of a 10 second clip of music. The training and test data is made up of 3000 audio loops obtained from the Apple program Soundtrack. The data clips from Soundtrack where chosen since each loop in Soundtrack is on average 10 seconds in length and repeats a particular beat. Conveniently the clips are also made to be added together to create a full recording. This enabled me to created 77,000 unique training examples by adding the tracks together in random combinations. I held 1000 of the original 3000 examples aside to be used as the test set. Unfortunately, 10 seconds of data at 44100 samples/second is too much data to process efficiently with the algorithm, therefore, I compressed all the data to 1024 samples/second. This compression reduces the complexity of the data while keeping the overall beat; the difference is hardly noticeable when played back on normal speakers. Each audio loop is sampled at 16bit resolution which is then normalize for use with the algorithm. The primary difference between the audio data and the video data is that each temporal step of the video data contains 4 frames of data, where each frame of the data contains 16x16 samples from the same instance in time. Each temporal step of the audio data contains 1 second, or 1024 samples of audio, where each sample has been obtained at a different instance in time. Initially I attempted to fit the RNN model using only the 2000 training tracks supplied by Soundtrack. The plot below shows the average reconstruction error for two different networks fit to the training data. Oddly the 2 layer, 400 nodes per layer, RNN achieved a lower reconstruction error then the 3 layer RNN. When applied to the 1000 example test set the average reconstruction error was nearly identical to the training set, implying a good fit. Unfortunately, when listening to the estimated data it is clear the model has converged to a locally optimal solution rather then the global solution causing the model to poorly reproduce the final second of audio. This occurs for both the examples in the test set and the examples in the training set. The plots below show the estimated data and the actual data for one training set example,

probably the best one out of the many examples I viewed by hand. Rerunning the 3 layer, 400 nodes per layer, RNN model with the 77,000 training examples created by combining random combinations of the original 2000 training tracks supplied by Soundtrack resulted in a considerably better average reconstruction error as seen in the plot below. Some of the audio samples the model fails to predict entirely. However, after listening to a few of the samples that the algorithm best fits and then listening to a few of the ones that the algorithm fails to predict. It appears that the more times the beat repeats in the first 9 seconds the better it can predict the last second. The samples where the algorithm completely fails seem to correspond to samples where the beat never fully repeats in the 10 second interval. The two graphs below shows the actual last second of data, and the predicted last second of data for a test-set sample where the beat repeated a total of 2 times in the 10 second interval. The estimated last second is rather noisy, but appears to accurately reproduce the subtly in the beat. Filtering out the high frequency noise, and playing the clip back with the estimated last second produces a near match to the actual data.

Shown below is this models fit to the training data example that the previous model failed to fit. Conclusion: This project demonstrated how an RNN using pre-training could be effective in predicting audio. These results show that an RNN can successfully predict the last second of audio data from the previous 9 seconds. However, it was also discovered that a large number of training examples are necessary to prevent the algorithm from converging to local minima. Further research needs to be completed testing different numbers of layers and different node sizes to find an optimal RNN for this type of audio prediction. The loops from Soundtrack only contained audio from around 500 different instruments. Therefore, it would be interesting to try predicting vocal audio clips, or even to test the algorithms sensitivity to the different types of instruments. References: Quoc, Le. Predicting the immediate future with Recurrent Neural Networks: Pretraining and Applications, NIPS, 2011.