Supplementary material for Inverting Visual Representations with Convolutional Networks

Similar documents
arxiv: v3 [cs.ne] 3 Dec 2015

Computer Vision for HCI. Image Pyramids. Image Pyramids. Multi-resolution image representations Useful for image coding/compression

Chapter 10 Basic Video Compression Techniques

An Introduction to Deep Image Aesthetics

Scene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

CS 7643: Deep Learning

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

COMP 249 Advanced Distributed Systems Multimedia Networking. Video Compression Standards

חלק מהשקפים מעובדים משקפים של פרדו דוראנד, טומס פנקהאוסר ודניאל כהן-אור קורס גרפיקה ממוחשבת 2009/2010 סמסטר א' Image Processing

Principles of Video Compression

Region Adaptive Unsharp Masking based DCT Interpolation for Efficient Video Intra Frame Up-sampling

No Reference, Fuzzy Weighted Unsharp Masking Based DCT Interpolation for Better 2-D Up-sampling

WYNER-ZIV VIDEO CODING WITH LOW ENCODER COMPLEXITY

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

MULTI-STATE VIDEO CODING WITH SIDE INFORMATION. Sila Ekmekci Flierl, Thomas Sikora

CHROMA CODING IN DISTRIBUTED VIDEO CODING

Region Based Laplacian Post-processing for Better 2-D Up-sampling

Skip Length and Inter-Starvation Distance as a Combined Metric to Assess the Quality of Transmitted Video

An Overview of Video Coding Algorithms

Video compression principles. Color Space Conversion. Sub-sampling of Chrominance Information. Video: moving pictures and the terms frame and

WE CONSIDER an enhancement technique for degraded

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

MPEG-2. ISO/IEC (or ITU-T H.262)

COMP 9519: Tutorial 1

A Novel Approach towards Video Compression for Mobile Internet using Transform Domain Technique

Reconfigurable Neural Net Chip with 32K Connections

Advanced Video Processing for Future Multimedia Communication Systems

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Video Processing Applications Image and Video Processing Dr. Anil Kokaram

Audio spectrogram representations for processing with Convolutional Neural Networks

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

Analysis of MPEG-2 Video Streams

Image Resolution and Contrast Enhancement of Satellite Geographical Images with Removal of Noise using Wavelet Transforms

Wyner-Ziv Coding of Motion Video

Multimedia Communications. Image and Video compression

Chapter 2 Introduction to

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Multimedia Communications. Video compression

HEBS: Histogram Equalization for Backlight Scaling

Improving Performance in Neural Networks Using a Boosting Algorithm

CERIAS Tech Report Preprocessing and Postprocessing Techniques for Encoding Predictive Error Frames in Rate Scalable Video Codecs by E

Image-to-Markup Generation with Coarse-to-Fine Attention

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

AN UNEQUAL ERROR PROTECTION SCHEME FOR MULTIPLE INPUT MULTIPLE OUTPUT SYSTEMS. M. Farooq Sabir, Robert W. Heath and Alan C. Bovik

Research Article Design and Analysis of a High Secure Video Encryption Algorithm with Integrated Compression and Denoising Block

Stereo Super-resolution via a Deep Convolutional Network

Chrominance Subsampling in Digital Images

Digital Video Telemetry System

HIGH QUALITY GEOMETRY DISTORTION TOOL FOR USE WITH LCD AND DLP PROJECTORS

OVE EDFORS ELECTRICAL AND INFORMATION TECHNOLOGY

Chord Classification of an Audio Signal using Artificial Neural Network

Data Storage and Manipulation

A look at the MPEG video coding standard for variable bit rate video transmission 1

Digital Correction for Multibit D/A Converters

Format Conversion Design Challenges for Real-Time Software Implementations

Information Transmission Chapter 3, image and video

Adaptive Key Frame Selection for Efficient Video Coding

Project Proposal: Sub pixel motion estimation for side information generation in Wyner- Ziv decoder.

Lecture 2 Video Formation and Representation

CPS311 Lecture: Sequential Circuits

Transform Coding of Still Images

Joint Image and Text Representation for Aesthetics Analysis

UC San Diego UC San Diego Previously Published Works

Introduction. Edge Enhancement (SEE( Advantages of Scalable SEE) Lijun Yin. Scalable Enhancement and Optimization. Case Study:

MPEGTool: An X Window Based MPEG Encoder and Statistics Tool 1

DELTA MODULATION AND DPCM CODING OF COLOR SIGNALS

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

E E Introduction to Wavelets & Filter Banks Spring Semester 2009

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

By David Acker, Broadcast Pix Hardware Engineering Vice President, and SMPTE Fellow Bob Lamm, Broadcast Pix Product Specialist

Streamcrest Motion1 Test Sequence and Utilities. A. Using the Motion1 Sequence. Robert Bleidt - June 7,2002

Color Quantization of Compressed Video Sequences. Wan-Fung Cheung, and Yuk-Hee Chan, Member, IEEE 1 CSVT

Video coding standards

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

1 Introduction to PSQM

IMAGE AESTHETIC PREDICTORS BASED ON WEIGHTED CNNS. Oce Print Logic Technologies, Creteil, France

JPEG2000: An Introduction Part II

UNIVERSAL SPATIAL UP-SCALER WITH NONLINEAR EDGE ENHANCEMENT

High Quality Digital Video Processing: Technology and Methods

How Does H.264 Work? SALIENT SYSTEMS WHITE PAPER. Understanding video compression with a focus on H.264

8/30/2010. Chapter 1: Data Storage. Bits and Bit Patterns. Boolean Operations. Gates. The Boolean operations AND, OR, and XOR (exclusive or)

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

Chapter 3 Fundamental Concepts in Video. 3.1 Types of Video Signals 3.2 Analog Video 3.3 Digital Video

Digital Image and Fourier Transform

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Towards More Efficient DSP Implementations: An Analysis into the Sources of Error in DSP Design

Error Resilience for Compressed Sensing with Multiple-Channel Transmission

Multicore Design Considerations

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Proposed Standard Revision of ATSC Digital Television Standard Part 5 AC-3 Audio System Characteristics (A/53, Part 5:2007)

Overview: Video Coding Standards

Optimized Color Based Compression

Introduction to image compression

Indexing local features and instance recognition

Robust Joint Source-Channel Coding for Image Transmission Over Wireless Channels

Recurrent computations for visual pattern completion Supporting Information Appendix

FPGA Implementation of Convolutional Encoder And Hard Decision Viterbi Decoder

Convolutional Neural Network-Based Block Up-sampling for Intra Frame Coding

Impact of scan conversion methods on the performance of scalable. video coding. E. Dubois, N. Baaziz and M. Matta. INRS-Telecommunications

Avivo and the Video Pipeline. Delivering Video and Display Perfection

Transcription:

Supplementary material for Inverting Visual Representations with Convolutional Networks Alexey Dosovitskiy Thomas Brox University of Freiburg Freiburg im Breisgau, Germany {dosovits,brox}@cs.uni-freiburg.de Network architectures Table 1 shows the architecture of AlexNet. Tables 2-6 show the architectures of networks we used for inverting different features. After each fully connected and convolutional layer there is always a leaky ReLU nonlinearity. Networks for inverting HOG and LBP have two streams. Stream A compresses the input features spatially and accumulates information over large regions. We found this crucial to get good estimates of the overall brightness of the image. Stream B does not compress spatially and hence can better preserve fine local details. At one points the outputs of the two streams are concatenated and processed jointly, denoted by J. K stands for kernel size, S for stride. Image HOG our SIFT our LBP our Shallow features details As mentioned, in the paper, for all three methods we use implementations from the VLFeat library [2] with the default settings. We use the Felzenszwalb et al. version of HOG with cell size 8. For SIFT we used 3 levels per octave, the first octave was 0 (corresponding to full resolution), the number of octaves was set automatically, effectively searching keypoints of all possible sizes. The LBP version we used works with 3 3 pixel neighborhoods. Each of the 8 non-central bits is equal to one if the corresponding pixel is brighter than the central one. All possible 256 patterns are quantized into 58 patterns. These include 56 patterns with exactly one transition from 0 to 1 when going around the central pixel, plus one quantized pattern comprising two uniform patterns, plus one quantized pattern containing all other patterns. The quantized LBP patterns are then grouped into local histograms over cells of 16 16 pixels. Experiments: shallow representations Figure 1 shows several images and their reconstructions from HOG, SIFT and LBP. HOG allows for the best reconstruction, SIFT slightly worse, LBP yet slightly worse. Colors are often reconstructed correctly, but sometimes are wrong, for ex- Figure 1: Inversion of shallow image representations. 1

layer CONV1 CONV2 CONV3 CONV4 processing conv1 mpool1 conv2 mpool2 conv3 conv4 conv5 mpool5 fc6 drop6 fc7 drop7 fc8 steps relu1 norm1 relu2 norm2 relu3 relu4 relu5 relu6 relu7 out size 55 27 27 13 13 13 13 6 1 1 1 1 1 out channels 96 96 256 256 384 384 256 256 4096 4096 4096 4096 1000 Table 1: Summary of the AlexNet network. Input image size is 227 227. ample in the last row. Interestingly, all network typically agree on estimated colors. Experiments: AlexNet We show here several additional figures similar to ones from the main paper. Reconstructions from different layers of AlexNet are shown in Figure 2. Figure 3 shows results illustrating the dark knowledge hypothesis, similar to Figure 8 from the main paper. We reconstruct from all features, as well as from only 5 largest ones or all except the 5 largest ones. It turns out that the top 5 activations are not very important. Figure 4 shows images generated by activating single neurons in different layers and setting all other neurons to zero. Particularly interpretable are images generated this way from. Every neuron corresponds to a class. Hence the image generated from the activation of, say, apple neuron, could be expected to be a stereotypical apple. conva1 HOG 32 32 31 5 2 16 16 256 conva2 conva1 16 16 256 5 2 8 8 512 conva3 conva2 8 8 512 3 2 4 4 1024 upconva1 conva3 4 4 1024 4 2 8 8 512 upconva2 upconva1 8 8 512 4 2 16 16 256 upconva3 upconva2 16 16 256 4 2 32 32 128 convb1 HOG 32 32 31 5 1 32 32 128 convb2 convb1 32 32 128 3 1 32 32 128 convj1 {upconva3, convb2} 32 32 256 3 1 32 32 256 convj2 convj1 32 32 256 3 1 32 32 128 upconvj4 convj2 32 32 128 4 2 64 64 64 upconvj5 upconvj4 64 64 64 4 2 128 128 32 upconvj6 upconvj5 128 128 32 4 2 256 256 3 conva1 LBP 16 16 58 5 2 8 8 256 conva2 conva1 8 8 256 5 2 4 4 512 conva3 conva2 4 4 512 3 1 4 4 1024 upconva1 conva3 4 4 1024 4 2 8 8 512 upconva2 upconva1 8 8 512 4 2 16 16 256 convb1 LBP 16 16 58 5 1 16 16 128 convb2 convb1 16 16 128 3 1 16 16 128 convj1 {upconva2, convb2} 16 16 384 3 1 16 16 256 convj2 convj1 16 16 256 3 1 16 16 128 upconvj3 convj2 16 16 128 4 2 32 32 128 upconvj4 upconvj3 32 32 128 4 2 64 64 64 upconvj5 upconvj4 64 64 64 4 2 128 128 32 upconvj6 upconvj5 128 128 32 4 2 256 256 3 Table 4: Network for reconstructing from LBP features. conv1 AlexNet- 6 6 256 3 1 6 6 256 conv2 conv1 6 6 256 3 1 6 6 256 conv3 conv2 6 6 256 3 1 6 6 256 upconv1 conv3 6 6 256 5 2 12 12 256 upconv2 upconv1 12 12 256 5 2 24 24 128 upconv3 upconv2 24 24 128 5 2 48 48 64 upconv4 upconv3 48 48 64 5 2 96 96 32 upconv5 upconv4 96 96 32 5 2 192 192 3 Table 5: Network for reconstructing from AlexNet features. Table 2: Network for reconstructing from HOG features. conv1 SIFT 64 64 133 5 2 32 32 256 conv2 conv1 32 32 256 3 2 16 16 512 conv3 conv2 16 16 512 3 2 8 8 1024 conv4 conv3 8 8 1024 3 2 4 4 2048 conv5 conv4 4 4 2048 3 1 4 4 2048 conv6 conv5 4 4 2048 3 1 4 4 1024 upconv1 conv6 4 4 1024 4 2 8 8 512 upconv2 upconv1 8 8 512 4 2 16 16 256 upconv3 upconv2 16 16 256 4 2 32 32 128 upconv4 upconv3 32 32 128 4 2 64 64 64 upconv5 upconv4 64 64 64 4 2 128 128 32 upconv6 upconv5 128 128 32 4 2 256 256 3 fc1 AlexNet- 1000 4096 fc2 fc1 4096 4096 fc3 fc2 4096 4096 reshape fc3 4096 4 4 256 upconv1 reshape 4 4 256 5 2 8 8 256 upconv2 upconv1 8 8 256 5 2 16 16 128 upconv3 upconv2 16 16 128 5 2 32 32 64 upconv4 upconv3 32 32 64 5 2 64 64 32 upconv5 upconv4 64 64 32 5 2 128 128 3 Table 6: Network for reconstructing from AlexNet features. Table 3: Network for reconstructing from SIFT features.

What we observe looks rather like it might be the average of all images of the class. For some classes the reconstructions are somewhat interpretable, for others not so much. Qualitative comparison of reconstructions with our method to the reconstructions of [1] and the results with AlexNet-based autoencoders is given in Figure 5. Reconstructions from feature vectors obtained by interpolating between feature vectors of two images are shown in Figure 6, both for fixed AlexNet and autoencoder training. More examples of such interpolations with fixed AlexNet are shown in Figure 7. As described in section 5.5 of the main paper, we tried two different distributions for sampling random feature activations: a histogram-based and a truncated Gaussian. Figure 8 shows the results with fixed AlexNet network and truncated Gaussian distribution. Figures 9 and 10 show images generated with autoencoder-trained networks. Note that images generated from autoencoders look much less realistic than images generated with a network with fixed AlexNet weights. This indicates that reconstructing from AlexNet features requires a strong natural image prior. References [1] A. Mahendran and A. Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015. 3, 5 [2] A. Vedaldi and B. Fulkerson. Vlfeat: an open and portable library of computer vision algorithms. In International Conference on Multimedia, pages 1469 1472, 2010. 1

Image CONV1 CONV2 CONV3 CONV4 Figure 2: Reconstructions from different layers of AlexNet. Image all top5 notop5 Figure 3: Left to right: input image, reconstruction from fc8, reconstruction from 5 largest activations in, reconstruction from all activations except 5 largest ones. Figure 4: Reconstructions from single neuron activations in the fully connected layers of AlexNet. The neurons correspond to classes, left to right: kite, convertible, desktop computer, school bus, street sign, soup bowl, bell pepper, soccer ball.

Image CONV1 CONV2 CONV3 CONV4 Our [1] AE Our [1] AE Figure 5: Reconstructions from different layers of AlexNet with our method and [1]. CONV4 Figure 6: Interpolation between the features of two images. Left: AlexNet weights fixed, right: autoencoder.

CONV4 Figure 7: More interpolations between the features of two images with fixed AlexNet weights. Figure 8: Images generated from random feature vectors of top layers of AlexNet with the simpler truncated Gaussian distribution (see section 5.5 of the main paper). Figure 9: Images generated from random feature vectors of top layers of AlexNet-based autoencoders with the histogrambased distribution (see section 5.5 of the main paper).

Figure 10: Images generated from random feature vectors of top layers of AlexNet-based autoencoders with the simpler truncated Gaussian distribution (see section 5.5 of the main paper).