Scene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

Similar documents
A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

CS 7643: Deep Learning

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Supplementary material for Inverting Visual Representations with Convolutional Networks

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

Audio Cover Song Identification using Convolutional Neural Network

An AI Approach to Automatic Natural Music Transcription

An Introduction to Deep Image Aesthetics

Singer Traits Identification using Deep Neural Network

LAUREL. Laureate Digital Panel Meter for Load Cell & Microvolt Input ELECTRONICS, INC. Features. Description

Capturing Handwritten Ink Strokes with a Fast Video Camera

Predicting Similar Songs Using Musical Structure Armin Namavari, Blake Howell, Gene Lewis

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

DRAFT. Proposal to modify International Standard IEC

Joint Image and Text Representation for Aesthetics Analysis

Achieve Accurate Critical Display Performance With Professional and Consumer Level Displays

PaletteNet: Image Recolorization with Given Color Palette

Cisco Video Surveillance 6400 IP Camera

Palette Master Color Management Software

Instruction Manual. 1 Page

GG450 4/12/2010. Today s material comes from p in the text book. Please read and understand all of this material!

LCD and Plasma display technologies are promising solutions for large-format

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

Data-Driven Solo Voice Enhancement for Jazz Music Retrieval

Import and quantification of a micro titer plate image

AUTOMATIC MUSIC TRANSCRIPTION WITH CONVOLUTIONAL NEURAL NETWORKS USING INTUITIVE FILTER SHAPES. A Thesis. presented to

LAUREL ELECTRONICS, INC.

The Extron MGP 464 is a powerful, highly effective tool for advanced A/V communications and presentations. It has the

Pedestrian Detection with a Large-Field-Of-View Deep Network

Image-to-Markup Generation with Coarse-to-Fine Attention

Music Composition with RNN

Using the VP300 to Adjust Video Display User Controls

united.screens GmbH FUTURE DISPLAY TECHNOLOGY 2017 united.screens GmbH

Towards More Efficient DSP Implementations: An Analysis into the Sources of Error in DSP Design

Stereo Super-resolution via a Deep Convolutional Network

Judging a Book by its Cover

Cisco Video Surveillance 6050 IP Camera Data Sheet

Deep Jammer: A Music Generation Model

Recognizing Bird Species in Audio Files Using Transfer Learning

Extraction Methods of Watermarks from Linearly-Distorted Images to Maximize Signal-to-Noise Ratio. Brandon Migdal. Advisors: Carl Salvaggio

VP2780-4K. Best for CAD/CAM, photography, architecture and video editing.

OPERATING GUIDE. HIGHlite 660 series. High Brightness Digital Video Projector 16:9 widescreen display. Rev A June A

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Dither Explained. An explanation and proof of the benefit of dither. for the audio engineer. By Nika Aldrich. April 25, 2002

Achieve Accurate Color-Critical Performance With Affordable Monitors

Neural Network for Music Instrument Identi cation

FIBRE CHANNEL CONSORTIUM

Edge-Aware Color Appearance. Supplemental Material

Colour Matching Technology

Enabling editors through machine learning

The Effect of Plate Deformable Mirror Actuator Grid Misalignment on the Compensation of Kolmogorov Turbulence

OPERATING GUIDE. M-Vision Cine 3D series. High Brightness Digital Video Projector 16:9 widescreen display. Rev A August A

Igaluk To Scare the Moon with its own Shadow Technical requirements

Improving Color Text Sharpness in Images with Reduced Chromatic Bandwidth

Neural Aesthetic Image Reviewer

The MPC X & MPC Live Bible 1

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

True RMS AC/DC Mini Clamp Meter

MUSI-6201 Computational Music Analysis

Detecting Musical Key with Supervised Learning

Sources of Error in Time Interval Measurements

EIE: Efficient Inference Engine on Compressed Deep Neural Network

HC9000D. Color : Midnight Black

Elements of a Television System

Specially designed for compatibility with industry standard

2016 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT , 2016, SALERNO, ITALY

A DATA-DRIVEN APPROACH TO MID-LEVEL PERCEPTUAL MUSICAL FEATURE MODELING

21.5 ADS Full HD Widescreen LED Monitor Stunning 1920 x 1080 FHD display with 60Hz refresh rate Advanced Super Dimension Switch (ADS) panel

TOWARDS SCORE FOLLOWING IN SHEET MUSIC IMAGES

1080p Ultra-Short Throw Cinema Projector

AIM INTRODUCTION SIMPLIFIED WORKFLOW

DCI Requirements Image - Dynamics

1. INTRODUCTION. Index Terms Video Transcoding, Video Streaming, Frame skipping, Interpolation frame, Decoder, Encoder.

INTERNATIONAL TELECOMMUNICATION UNION

m RSC Chromatographie Integration Methods Second Edition CHROMATOGRAPHY MONOGRAPHS Norman Dyson Dyson Instruments Ltd., UK

Superior Digital Video Images through Multi-Dimensional Color Tables

High-resolution screens have become a mainstay on modern smartphones. Initial. Displays 3.1 LCD

IMAGE AESTHETIC PREDICTORS BASED ON WEIGHTED CNNS. Oce Print Logic Technologies, Creteil, France

Full High Definition Home Cinema Projector PT-AE1000

Technology Scaling Issues of an I DDQ Built-In Current Sensor

Microwave Counter, Power Meter and DVM in One Portable Package

2 Select the magic wand tool (M) in the toolbox. 3 Click the sky to select that area. Add to the. 4 Click the Quick Mask Mode button(q) in

WORKING WITH FRAME GRABS

FRAME ERROR RATE EVALUATION OF A C-ARQ PROTOCOL WITH MAXIMUM-LIKELIHOOD FRAME COMBINING

EPI. Thanks to Samantha Holdsworth!

gresearch Focus Cognitive Sciences

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Tutorial on Technical and Performance Benefits of AD719x Family

BUREAU OF ENERGY EFFICIENCY

Personal GPS navigator with electronic compass

BER MEASUREMENT IN THE NOISY CHANNEL

CS 7643: Deep Learning

Optimal Foraging. Cole Zmurchok Math 102 Section 106. October 17, 2016

Using Variational Autoencoders to Learn Variations in Data

Technical overview of the DVB-T2 switchover planning. cases studies. Digital Broadcasting Switchover Forum Johannesburg, South Africa

SIWAREX FTA Weighing Module for High Accuracy Requirements Calibrating SIWAREX FTA with SIWATOOL FTA

The PeRIPLO Propositional Interpolator

Transcription:

Scene Classification with Inception-7 Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

Julian Ibarz Vincent Vanhoucke

Task Classification of images into 10 different classes: Bedroom Bridge Church Outdoor Classroom Conference Room Dining Room Kitchen Living Room Restaurant Tower

Training/validation/test set Classification of images into 10 different classes: ~9.87 million training images 10 thousand test images 3 thousand validation images

Task Classification of images into 10 different classes: Bedroom Bridge Church Outdoor Classroom Conference Room Dining Room Kitchen Living Room Restaurant Tower

Evolution of Inception 1 Inception 5 (GoogLeNet) Inception 7a 1 Going Deeper with Convolutions, [C. Szegedy et al, CVPR 2015]

Structural changes from Inception 5 to 6 Filter concatenation Filter concatenation 5x5 Pooling Pooling Previous layer From 5 to 6 Previous layer

5x5 olution + ReLu olution + ReLu Each mini network has the same receptive field. Deeper: more expressive (ReLu on both layers). 25 / 18 times (~28%) cheaper (due to feature sharing). Computation savings can be used to increase the number of filters. Downside: Needs more memory at training time

Grid size reduction Inception 5 vs 6 Pooling stride 2 Filter concatenation 5x5 Pooling stride 2 stride 2 Pooling stride 2 Previous layer From 5 to 6 Previous layer Much cheaper!

Structural changes from Inception 6 to 7 Filter concatenation Filter concatenation 3x1 + 1x3 3x1 + 1x3 3x1 + 1x3 Pooling Pooling Previous layer From 6 to 7 Previous layer

3x1 1x3 1x3 olution+ ReLu 3x1 olution + ReLu Each mini network has the same receptive field. Deeper: more expressive (ReLu on both layers). 9 / 6 times (~33%) cheaper (due to feature sharing). Computation savings can be used to increase the number of filters. Downside: Needs more memory at training time

Inception-6 vs Inception-7 Padding Inception 6: SAME padding throughout: SAME padding VALID padding Input grid size Patch size Stride Output grid size Input grid size Patch size Stride Output grid size 8x8 1 8x8 8x8 5x5 1 8x8 8x8 2 4x4 8x8 4 2x2 Output size is independent of patch size Padding with zero values 7x7 1 5x5 7x7 5x5 1 7x7 2 7x7 4 2x2 Output size depends on the patch size No padding: each patch is fully contained

Inception-6 vs Inception-7 Padding Advantages of padding methods SAME padding More equal distribution of gradients Less boundary effects No tunnel vision (sensitivity drop at the border) VALID padding More refined: higher grid sizes at the same computational cost Stride Inception 6 padding Inception 7 padding 1 SAME SAME (VALID on first few layers) 2 SAME VALID

Inception-6 vs Inception-7 Padding Stride Inception 6 padding Inception 7 padding 1 SAME SAME (VALID on first few layers) 2 SAME VALID Inception 6: 224 112 56 28 14 7 Inception 7: 299 147 73 71 35 17 8 30% reduction of computation compared to a 299x299 network with SAME padding throughout.

Spending the computational savings Grid Size Inception 5 filters Inception 6 filters Inception 7 filters 28x28 (35x35 for Inception 7) 256 320 288 14x14 (17x17 for Inception 7) 528 576 1248 7x7 (8x8 for Inception 7) 1024 1024 2048 Note: filter size denotes the maximum number of filters/grid cell for each grid size. Typical number of filters is lower, especially for Inception 7.

LSUN specific modification...... 73x73 Conv (stride 2) 73x73 Conv (stride 2) 73x73 Max Pooling (stride 2) 147x147 299x299 7x7 (stride 2) Input Accomodate low resolution images and image patches 73x73 1551 7x7 (stride 2) Input

Training - Stochastic gradient descent - Momentum (0.9) - Fixed learning rate decay of 0.94 - Batch size: 32 - Random patches: - Minimum sample area: 15% of the full image - Minimum aspect ratio: 3:4 (affine distortion) - random constrast, brightness, hue and saturation - Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C.Szegedy, ICML 2015)

Task Classification of images into 10 different classes: Bedroom Bridge Church Outdoor Classroom Conference Room Dining Room Kitchen Living Room Restaurant Tower

Manual Score Calibration Compute weights for each label that maximizes the score on half of the validation set Cross-validation on the other half of the validation set Simplify weights after error-minimization to avoid overfitting to the validation set. Final score multipliers: 4.0 for church outdoor 2.0 for conference room Probable reason: classes are under-represented in the training set.

Evaluation - Crop averaging at 3 different scales (Going Deeper with Convolutions, Szegedy et al, CVPR 2015): score averaging of 144 crops/image Evaluation method Accuracy (on validation set) Single crop 89.2% Multi crop 89.7% Manual score calibration 91.2%

Releasing Pretrained Inception and MultiBox Academic criticism: Results are hard to reproduce We will be releasing pretrained Caffe models for: GoogLeNet (Inception 5) BN-Inception (Inception 6) MultiBox-Inception proposal generator (based on Inception 6) Contact: Yangqing Jia

Acknowledgments We would like to thank: Organizers of LSUN DistBelief and Image Annotation teams at Google for their support for the machine learning and evaluation infrastructure.