CS 7643: Deep Learning

Similar documents
CS 7643: Deep Learning

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Scene Classification with Inception-7. Christian Szegedy with Julian Ibarz and Vincent Vanhoucke

An Introduction to Deep Image Aesthetics

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Supplementary material for Inverting Visual Representations with Convolutional Networks

CS 1674: Intro to Computer Vision. Face Detection. Prof. Adriana Kovashka University of Pittsburgh November 7, 2016

Improving Performance in Neural Networks Using a Boosting Algorithm

VBM683 Machine Learning

Image-to-Markup Generation with Coarse-to-Fine Attention

Reconfigurable Neural Net Chip with 32K Connections

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

AUTOMATIC MUSIC TRANSCRIPTION WITH CONVOLUTIONAL NEURAL NETWORKS USING INTUITIVE FILTER SHAPES. A Thesis. presented to

Joint Image and Text Representation for Aesthetics Analysis

CSE 166: Image Processing. Overview. Representing an image. What is an image? History. What is image processing? Today. Image Processing CSE 166

CS229 Project Report Polyphonic Piano Transcription

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

arxiv: v3 [cs.ne] 3 Dec 2015

OPTICAL MUSIC RECOGNITION WITH CONVOLUTIONAL SEQUENCE-TO-SEQUENCE MODELS

Music Genre Classification

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Lab 6: Edge Detection in Image and Video

A Transfer Learning Based Feature Extractor for Polyphonic Sound Event Detection Using Connectionist Temporal Classification

ECS 189G: Intro to Computer Vision March 31 st, Yong Jae Lee Assistant Professor CS, UC Davis

BBM 413 Fundamentals of Image Processing Dec. 11, Erkut Erdem Dept. of Computer Engineering Hacettepe University. Segmentation Part 1

CS 2770: Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh January 5, 2017

Computer Vision for HCI. Image Pyramids. Image Pyramids. Multi-resolution image representations Useful for image coding/compression

EDDY CURRENT IMAGE PROCESSING FOR CRACK SIZE CHARACTERIZATION

Singer Traits Identification using Deep Neural Network

Experimenting with Musically Motivated Convolutional Neural Networks

Neural Network for Music Instrument Identi cation

CS 1699: Intro to Computer Vision. Introduction. Prof. Adriana Kovashka University of Pittsburgh September 1, 2015

Indexing local features and instance recognition

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Audio spectrogram representations for processing with Convolutional Neural Networks

LSTM Neural Style Transfer in Music Using Computational Musicology

Vector-Valued Image Interpolation by an Anisotropic Diffusion-Projection PDE

NENS 230 Assignment #2 Data Import, Manipulation, and Basic Plotting

DISTRIBUTION STATEMENT A 7001Ö

Murdoch redux. Colorimetry as Linear Algebra. Math of additive mixing. Approaching color mathematically. RGB colors add as vectors

A Framework for Segmentation of Interview Videos

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

DATA! NOW WHAT? Preparing your ERP data for analysis

Satoshi Iizuka* Edgar Simo-Serra* Hiroshi Ishikawa Waseda University. (*equal contribution)

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Image Steganalysis: Challenges

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Lecture 5: Clustering and Segmentation Part 1

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT

Automatic Piano Music Transcription

Stereo Super-resolution via a Deep Convolutional Network

DATA COMPRESSION USING THE FFT

Spotting Violence from Space

Analysis of local and global timing and pitch change in ordinary

Lecture 2 Video Formation and Representation

Fourier Transforms 1D

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

What You ll Learn Today

Representations of Sound in Deep Learning of Audio Features from Music

Bar Codes to the Rescue!

Motion Video Compression

Lecture 9 Source Separation

Analysis of Visual Similarity in News Videos with Robust and Memory-Efficient Image Retrieval

Less is More: Picking Informative Frames for Video Captioning

Capturing Handwritten Ink Strokes with a Fast Video Camera

Contents. Introduction. Skyworks Solutions (SWKS) Cypress Semiconductor (CY) Sierra Wireless (SWIR) Silicon Labs (SLAB) Rockwell Automation (ROK)

Technical Specifications

Book: Fundamentals of Music Processing. Audio Features. Book: Fundamentals of Music Processing. Book: Fundamentals of Music Processing

Design Principles and Practices. Cassini Nazir, Clinical Assistant Professor Office hours Wednesdays, 3-5:30 p.m. in ATEC 1.

MUSI-6201 Computational Music Analysis

Normalization Methods for Two-Color Microarray Data

Hearing Sheet Music: Towards Visual Recognition of Printed Scores

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

AN INTEGRATED MATLAB SUITE FOR INTRODUCTORY DSP EDUCATION. Richard Radke and Sanjeev Kulkarni

An AI Approach to Automatic Natural Music Transcription

arxiv: v1 [cs.cv] 16 Jul 2017

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Browsing News and Talk Video on a Consumer Electronics Platform Using Face Detection

Digital Image and Fourier Transform

2. ctifile,s,h, CALDB,,, ACIS CTI ARD file (NONE none CALDB <filename>)

R H Y T H M G E N E R A T O R. User Guide. Version 1.3.0

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Semantic Image Segmentation via Deep Parsing Network

Film Grain Technology

Table of content. Table of content Introduction Concepts Hardware setup...4

Content storage architectures

arxiv: v1 [cs.lg] 15 Jun 2016

Heuristic Search & Local Search

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Multicore Design Considerations

Homework 2 Key-finding algorithm

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Iterative Deletion Routing Algorithm

חלק מהשקפים מעובדים משקפים של פרדו דוראנד, טומס פנקהאוסר ודניאל כהן-אור קורס גרפיקה ממוחשבת 2009/2010 סמסטר א' Image Processing

MUSIC scores are the main medium for transmitting music. In the past, the scores started being handwritten, later they

Instance Recognition. Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision

(Refer Slide Time: 1:45)

Speech and Speaker Recognition for the Command of an Industrial Robot

Post-Routing Layer Assignment for Double Patterning

Transcription:

CS 7643: Deep Learning Topics: Stride, padding Pooling layers Fully-connected layers as convolutions Backprop in conv layers Dhruv Batra Georgia Tech

Invited Talks Sumit Chopra on CNNs for Pixel Labeling Head of AI Research @ Imagen Technologies Previously Facebook AI Research Tue 09/26, in class (C) Dhruv Batra 2

HW1 due soon 09/22 Administrativia HW2 + PS2 both coming out on 09/22 Note on class schedule coming up Switching to paper reading starting next week. https://docs.google.com/spreadsheets/d/1un31ycwag6nhjv YPUVKMy3vHwW-h9MZCe8yKCqw0RsU/edit#gid=0 First review due: Tue 09/26 First student presentation due: Thr 09/28 (C) Dhruv Batra 3

Paper Reading Intuition: Multi-Task Learning Paper 1 Task Layers Paper 2 data CS 7643: Deep Learning www.cc.gatech.edu/classes/ay2018/cs7643_fall/ piazza.com/gatech/fall2017/cs7643 Canvas: gatech.instructure.com/courses/772 Shared Layer 1 Lectures Shared Layer 2 Shared Layer N Paper of the Day Layers... Task Layers Task Paper 6 Dhruv Batra School of Interactive Computing Georgia Tech (C) Dhruv Batra 8 (C) Dhruv Batra 4

Paper Reviews Length 200-400 words. Due: Midnight before class on Piazza Organization Summary: What is this paper about? What is the main contribution? Describe the main approach & results. Just facts, no opinions yet. List of positive points / Strengths: Is there a new theoretical insight? Or a significant empirical advance? Did they solve a standing open problem? Or is a good formulation for a new problem? Or a faster/better solution for an existing problem? Any good practical outcome (code, algorithm, etc)? Are the experiments well executed? Useful for the community in general? List of negative points / Weaknesses: What would you do differently? Any missing baselines? missing datasets? any odd design choices in the algorithm not explained well? quality of writing? Is there sufficient novelty in what they propose? Has it already been done? Minor variation of previous work? Why should anyone care? Is the problem interesting and significant? Reflections How does this relate to other papers we have read? What are the next research directions in this line of work? (C) Dhruv Batra 5

Presentations Frequency Once in the semester: 5 min presentation. Expectations Present details of 1 paper Describe formulation, experiment, approaches, datasets Encouraged to present a broad picture Show results, videos, gifs, etc. Please clearly cite the source of each slide that is not your own. Meet with TA 1 week before class to dry run presentation Worth 40% of presentation grade (C) Dhruv Batra 6

Administrativia Project Teams Google Doc https://docs.google.com/spreadsheets/d/1aaxy0je4labhvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 Project Title 1-3 sentence project summary TL;DR Team member names + GT IDs (C) Dhruv Batra 7

Recap of last time (C) Dhruv Batra 8

Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop FPROP BPROP SUM + COPY + (C) Dhruv Batra 14

Key Computation in DL: Forward-Prop (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun 15

(C) Dhruv Batra 16

Jacobian of ReLU 4096-d input vector Q: what is the size of the Jacobian matrix? f(x) = max(0,x) (elementwise) 4096-d output vector 17 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobians of FC-Layer (C) Dhruv Batra 18

Jacobians of FC-Layer (C) Dhruv Batra 19

Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters!!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. Slide Credit: Marc'Aurelio Ranzato 21

Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 22

Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 23

Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels Slide Credit: Marc'Aurelio Ranzato 24

Convolutions for mathematicians (C) Dhruv Batra 25

"Convolution of box signal with itself2" by Convolution_of_box_signal_with_itself.gif: Brian Ambergderivative work: Tinos (talk) - Convolution_of_box_signal_with_itself.gif. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/file:convolution_of_box_signal_with_itself2.gif#/media/file:convolution_of_box_signal_wi th_itself2.gif (C) Dhruv Batra 26

Convolutions for computer scientists (C) Dhruv Batra 27

Convolutions for programmers (C) Dhruv Batra 28

Convolution Explained http://setosa.io/ev/image-kernels/ https://github.com/bruckner/deepviz (C) Dhruv Batra 29

Plan for Today Convolutional Neural Networks Stride, padding Pooling layers Fully-connected layers as convolutions Backprop in conv layers (C) Dhruv Batra 30

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 31

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 32

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 33

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 34

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 35

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 36

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 37

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 38

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 39

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 40

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 41

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 42

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 43

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 44

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 45

Convolutional Layer Mathieu et al. Fast training of CNNs through FFTs ICLR 2014 (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 46

Convolutional Layer -1 0 1 * -1 0 1-1 0 1 = (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 47

Convolutional Layer Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 48

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 3072 10 x 3072 weights 1 10 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 3072 10 x 3072 weights 1 10 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolutional Neural Networks a INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection (C) Dhruv Batra Image Credit: Yann LeCun, Kevin Murphy 51

FC vs Conv Layer 52

Convolution Layer 32x32x3 image -> preserve spatial structure 32 height 3 32 depth width Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 3 32 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32x32x3 image Filters always extend the full depth of the input volume 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 3 32 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32 32x32x3 image 5x5x3 filter 3 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 1 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer consider a second, green filter 32 32x32x3 image 5x5x3 filter activation maps 28 convolve (slide) over all spatial locations 3 32 1 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 3 32 6 28 We stack these up to get a new image of size 28x28x6! Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Preview: ConvNet is a sequence of Convolution Layers, interspersed with activation functions 32 28 3 32 CONV, ReLU e.g. 6 5x5x3 filters 6 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 24 3 32 CONV, ReLU e.g. 6 5x5x3 filters 28 6 CONV, ReLU e.g. 10 5x5x6 filters 10 24 CONV, ReLU. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Preview [Zeiler and Fergus 2013] Visualization of VGG-16 by Lane McIntosh. VGG-16 architecture from [Simonyan and Zisserman 2014]. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

one filter => one activation map example 5x5 filters (32 total) Figure copyright Andrej Karpathy. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolutional Neural Networks a INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection (C) Dhruv Batra Image Credit: Yann LeCun, Kevin Murphy 64

preview: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 1 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 => 5x5 output Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output! Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 doesn t fit! cannot apply 3x3 filter on 7x7 input with stride 3. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

F N F N Output size: (N - F) / stride + 1 e.g. N = 7, F = 3: stride 1 => (7-3)/1 + 1 = 5 stride 2 => (7-3)/2 + 1 = 3 stride 3 => (7-3)/3 + 1 = 2.33 :\ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

In practice: Common to zero pad the border 0 0 0 0 0 0 0 0 e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 0 0 (recall:) (N - F) / stride + 1 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

In practice: Common to zero pad the border 0 0 0 0 0 0 0 0 0 e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output! 0 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

In practice: Common to zero pad the border 0 0 0 0 0 0 0 0 0 0 e.g. input 7x7 3x3 filter, applied with stride 1 pad with 1 pixel border => what is the output? 7x7 output! in general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Remember back to E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially! (32 -> 28 -> 24...). Shrinking too fast is not good, doesn t work well. 32 28 24 3 32 CONV, ReLU e.g. 6 5x5x3 filters 28 6 CONV, ReLU e.g. 10 5x5x6 filters 10 24 CONV, ReLU. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size:? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so 32x32x10 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params => 76*10 = 760 (+1 for bias) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P =? (whatever fits) - F = 1, S = 1, P = 0 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(btw, 1x1 convolution layers make perfect sense) 56 1x1 CONV with 32 filters 56 64 56 (each filter has size 1x1x64, and performs a 64-dimensional dot product) 32 56 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Example: CONV layer in Torch Torch is licensed under BSD 3-clause. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Example: CONV layer in Caffe Caffe is licensed under BSD 2-Clause. Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

The brain/neuron view of CONV Layer 32 28 An activation map is a 28x28 sheet of neuron outputs: 1. Each is connected to a small region in the input 2. All of them share parameters 3 32 28 5x5 filter -> 5x5 receptive field for each neuron Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Reminder: Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input Each neuron looks at the full input volume activation 1 3072 10 x 3072 weights 1 10 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

two more layers to go: POOL/FC Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Pooling Layer Let us assume filter is an eye detector. Q.: how can we make the detection robust to the exact location of the eye? (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 94

Pooling Layer By pooling (e.g., taking max) filter responses at different locations we gain robustness to the exact spatial location of features. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 95

Pooling layer - makes the representations smaller and more manageable - operates over each activation map independently: Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

MAX POOLING dim 1 Single depth slice 1 1 2 4 5 6 7 8 3 2 1 0 1 2 3 4 max pool with 2x2 filters and stride 2 6 8 3 4 dim 2 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Max-pooling: Pooling Layer: Examples h n i (r, c) = max r2n(r), c2n(c) hn 1 i ( r, c) Average-pooling: L2-pooling: h n i (r, c) = h n i (r, c) = L2-pooling over features: s X h n i (r, c) = mean r2n(r), c2n(c) hn 1 i ( r, c) s X r2n(r), c2n(c) j2n(i) h n 1 i (r, c) 2 h n 1 i ( r, c) 2 (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 98

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Common settings: F = 2, S = 2 F = 3, S = 2 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Pooling Layer: Receptive Field Size h n 1 h n Pool. h n 1 Conv. layer layer If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1) (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 101

Pooling Layer: Receptive Field Size h n 1 h n Pool. h n 1 Conv. layer layer If convolutional filters have size KxK and stride 1, and pooling layer has pools of size PxP, then each unit in the pooling layer depends upon a patch (at the input of the preceding conv. layer) of size: (P+K-1)x(P+K-1) (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 102

Fully Connected Layer (FC layer) - Contains neurons that connect to the entire input volume, as in ordinary Neural Networks Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolutional Nets Example: http://yann.lecun.com/exdb/lenet/index.html INPUT 32x32 C1: feature maps 6@28x28 C3: f. maps 16@10x10 S4: f. maps 16@5x5 S2: f. maps 6@14x14 C5: layer 120 F6: layer 84 OUTPUT 10 Convolutions Subsampling Convolutions Full connection Gaussian connections Subsampling Full connection (C) Dhruv Batra Image Credit: Yann LeCun, Kevin Murphy 104

Note: After several stages of convolution-pooling, the spatial resolution is greatly reduced (usually to about 5x5) and the number of feature maps is large (several hundreds depending on the application). It would not make sense to convolve again (there is no translation invariance and support is too small). Everything is vectorized and fed into several fully connected layers. If the input of the fully connected layers is of size 5x5xN, the first fully connected layer can be seen as a conv. layer with 5x5 kernels. The next fully connected layer can be seen as a conv. layer with 1x1 kernels. (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 105

Classical View (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 106

H hidden units MxMxN, M small Fully conn. layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 107

Classical View = Inefficient (C) Dhruv Batra 108

Classical View (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 109

Re-interpretation Just squint a little! (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 110

Fully Convolutional Networks Can run on an image of any size! (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 111

H hidden units / 1x1xH feature maps MxMxN, M small Fully conn. layer / Conv. layer (H kernels of size MxMxN) (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 112

K hidden units / 1x1xK feature maps H hidden units / 1x1xH feature maps MxMxN, M small Fully conn. layer / Conv. layer (H kernels of size MxMxN) Fully conn. layer / Conv. layer (K kernels of size 1x1xH) (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 113

Viewing fully connected layers as convolutional layers enables efficient use of convnets on bigger images (no need to slide windows but unroll network over space as needed to re-use computation). TRAINING TIME Input Image CNN TEST TIME Input Image CNN y x (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 114

Viewing fully connected layers as convolutional layers enables efficient use of convnets on bigger images (no need to slide windows but unroll network over space as needed to re-use computation). TRAINING TIME Input Image CNN TEST TIME CNNs work on any image size! Input Image CNN y x Unrolling is order of magnitudes more eficient than sliding windows! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 115

Re-interpretation Just squint a little! (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 116

Fully Convolutional Networks Can run on an image of any size! (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 117

Fully Convolutional Networks Up-sample to get segmentation maps (C) Dhruv Batra Figure Credit: [Long, Shelhamer, Darrell CVPR15] 118

Benefit of this thinking Mathematically elegant Efficiency Can run network on arbitrary image Without multiple crops (C) Dhruv Batra 119

Summary - ConvNets stack CONV,POOL,FC layers - Trend towards smaller filters and deeper architectures - Trend towards getting rid of POOL/FC layers (just CONV) - Typical architectures look like [(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX where N is usually up to ~5, M is large, 0 <= K <= 2. - but recent advances such as ResNet/GoogLeNet challenge this paradigm Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n