CS 7643: Deep Learning

Similar documents
CS 7643: Deep Learning

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Singer Traits Identification using Deep Neural Network

SA4NCCP 4-BIT FULL SERIAL ADDER

Iterative Deletion Routing Algorithm

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Neural Network for Music Instrument Identi cation

Automatic Music Genre Classification

An Introduction to Deep Image Aesthetics

VBM683 Machine Learning

LSTM Neural Style Transfer in Music Using Computational Musicology

CSE 166: Image Processing. Overview. Representing an image. What is an image? History. What is image processing? Today. Image Processing CSE 166

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Introduction to Signal Processing D R. T A R E K T U T U N J I P H I L A D E L P H I A U N I V E R S I T Y

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

Image-to-Markup Generation with Coarse-to-Fine Attention

CS 1674: Intro to Computer Vision. Face Detection. Prof. Adriana Kovashka University of Pittsburgh November 7, 2016

AUTOMATIC MUSIC TRANSCRIPTION WITH CONVOLUTIONAL NEURAL NETWORKS USING INTUITIVE FILTER SHAPES. A Thesis. presented to

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Jazz Melody Generation from Recurrent Network Learning of Several Human Melodies

DIFFERENTIATE SOMETHING AT THE VERY BEGINNING THE COURSE I'LL ADD YOU QUESTIONS USING THEM. BUT PARTICULAR QUESTIONS AS YOU'LL SEE

2G Video Wall Guide Just Add Power HD over IP Page1 2G VIDEO WALL GUIDE. Revised

Department of Electrical and Computer Engineering University of Wisconsin Madison. Fall Final Examination CLOSED BOOK

Chapter 12. Synchronous Circuits. Contents

Improving Performance in Neural Networks Using a Boosting Algorithm

Reconfigurable Neural Net Chip with 32K Connections

VLSI IEEE Projects Titles LeMeniz Infotech

Various Applications of Digital Signal Processing (DSP)

Seeing Using Sound. By: Clayton Shepard Richard Hall Jared Flatow

Fourier Transforms 1D

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Music Genre Classification

Part 1: Introduction to Computer Graphics

Joint bottom-up/top-down machine learning structures to simulate human audition and musical creativity

NENS 230 Assignment #2 Data Import, Manipulation, and Basic Plotting

Design for Test. Design for test (DFT) refers to those design techniques that make test generation and test application cost-effective.

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

User Guide Version 1.1.0

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

A Discriminative Approach to Topic-based Citation Recommendation

CS 61C: Great Ideas in Computer Architecture

Sequential Logic. Introduction to Computer Yung-Yu Chuang

StatPatternRecognition: Status and Plans. Ilya Narsky, Caltech

technical note flicker measurement display & lighting measurement

SMART VEHICLE SCREENING SYSTEM USING ARTIFICIAL INTELLIGENCE METHODS

Why Engineers Ignore Cable Loss

arxiv: v1 [cs.lg] 15 Jun 2016

gresearch Focus Cognitive Sciences

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

Logic Design ( Part 3) Sequential Logic- Finite State Machines (Chapter 3)

DISTRIBUTION STATEMENT A 7001Ö

Music Composition with RNN

An MFA Binary Counter for Low Power Application

Indexing local features and instance recognition

Objectives: Topics covered: Basic terminology Important Definitions Display Processor Raster and Vector Graphics Coordinate Systems Graphics Standards

Introduction to Digital Signal Processing (Discrete-time Signal Processing) Prof. Ja-Ling Wu Dept. CSIE & GINM National Taiwan University

Vinfoil Infigo SF110 UFC module for UV Film Casting application

Part 4: Introduction to Sequential Logic. Basic Sequential structure. Positive-edge-triggered D flip-flop. Flip-flops classified by inputs

Module 8 VIDEO CODING STANDARDS. Version 2 ECE IIT, Kharagpur

Analyzing Modulated Signals with the V93000 Signal Analyzer Tool. Joe Kelly, Verigy, Inc.

חלק מהשקפים מעובדים משקפים של פרדו דוראנד, טומס פנקהאוסר ודניאל כהן-אור קורס גרפיקה ממוחשבת 2009/2010 סמסטר א' Image Processing

EL302 DIGITAL INTEGRATED CIRCUITS LAB #3 CMOS EDGE TRIGGERED D FLIP-FLOP. Due İLKER KALYONCU, 10043

Detecting Musical Key with Supervised Learning

Multicore Design Considerations

Dr. Charles J Antonelli The University of Michigan 10 April 10. A Festschrift for Dr. Richard A Volz 4/12/10 1

Review C program: foo.c Compiler Assembly program: foo.s Assembler Object(mach lang module): foo.o. Lecture #14

CS61C : Machine Structures

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

Less is More: Picking Informative Frames for Video Captioning

Reconfigurable Architectures. Greg Stitt ECE Department University of Florida

EEC 116 Fall 2011 Lab #5: Pipelined 32b Adder

[Dharani*, 4.(8): August, 2015] ISSN: (I2OR), Publication Impact Factor: 3.785

CS61C : Machine Structures

Bach2Bach: Generating Music Using A Deep Reinforcement Learning Approach Nikhil Kotecha Columbia University

Appendix B. Elements of Style for Proofs

Music Understanding and the Future of Music

DATA! NOW WHAT? Preparing your ERP data for analysis

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

CS 110 Computer Architecture. Finite State Machines, Functional Units. Instructor: Sören Schwertfeger.

Realizing Waveform Characteristics up to a Digitizer s Full Bandwidth Increasing the effective sampling rate when measuring repetitive signals

G406 application note for projector

MUHAMMAD NAEEM LATIF MCS 3 RD SEMESTER KHANEWAL

Distortion Analysis Of Tamil Language Characters Recognition

ACT-R ACT-R. Core Components of the Architecture. Core Commitments of the Theory. Chunks. Modules

CacheCompress A Novel Approach for Test Data Compression with cache for IP cores

AU-6407 B.Lib.Inf.Sc. (First Semester) Examination 2014 Knowledge Organization Paper : Second. Prepared by Dr. Bhaskar Mukherjee

A probabilistic framework for audio-based tonal key and chord recognition

Design Project: Designing a Viterbi Decoder (PART I)

Data Science + Content. Todd Holloway, Director of Content Science & Algorithms for Smart Content Summit, 3/9/2017

Supplementary material for Inverting Visual Representations with Convolutional Networks

Investigation of Look-Up Table Based FPGAs Using Various IDCT Architectures

Speech and Speaker Recognition for the Command of an Industrial Robot

FAX Image Compression

Post-Routing Layer Assignment for Double Patterning

PROFESSOR: I'd like to welcome you to this course on computer science. Actually, that's a terrible way to start.

Image Steganalysis: Challenges

Jin-Fu Li Advanced Reliable Systems (ARES) Laboratory. National Central University

Hidden Markov Model based dance recognition

2. AN INTROSPECTION OF THE MORPHING PROCESS

Music Similarity and Cover Song Identification: The Case of Jazz

Transcription:

CS 7643: Deep Learning Topics: Computational Graphs Notation + example Computing Gradients Forward mode vs Reverse mode AD Dhruv Batra Georgia Tech

Administrativia HW1 Released Due: 09/22 PS1 Solutions Coming soon (C) Dhruv Batra 2

Project Goal Chance to try Deep Learning Combine with other classes / research / credits / anything You have our blanket permission Extra credit for shooting for a publication Encouraged to apply to your research (computer vision, NLP, robotics, ) Must be done this semester. Main categories Application/Survey Compare a bunch of existing algorithms on a new application domain of your interest Formulation/Development Formulate a new model or algorithm for a new or old problem Theory Theoretically analyze an existing algorithm (C) Dhruv Batra 3

Administrativia Project Teams Google Doc https://docs.google.com/spreadsheets/d/1aaxy0je4labhvo DaWlc9zsmfKMyuGS39JAn9dpeXhhQ/edit#gid=0 Project Title 1-3 sentence project summary TL;DR Team member names + GT IDs (C) Dhruv Batra 4

Recap of last time (C) Dhruv Batra 5

How do we compute gradients? Manual Differentiation Symbolic Differentiation Numerical Differentiation Automatic Differentiation Forward mode AD Reverse mode AD aka backprop (C) Dhruv Batra 6

Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 7

Directed Acyclic Graphs (DAGs) Exactly what the name suggests Directed edges No (directed) cycles Underlying undirected cycles okay (C) Dhruv Batra 8

Directed Acyclic Graphs (DAGs) Concept Topological Ordering (C) Dhruv Batra 9

Directed Acyclic Graphs (DAGs) (C) Dhruv Batra 10

Computational Graphs Notation #1 f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) (C) Dhruv Batra 11

Computational Graphs Notation #2 f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) (C) Dhruv Batra 12

Example f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 13

Logistic Regression as a Cascade Given a library of simple functions Compose into a complicate function log 1 1+e w x w x (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun 14

Forward mode vs Reverse Mode Key Computations (C) Dhruv Batra 15

Forward mode AD g 16

Reverse mode AD g 17

Example: Forward mode AD f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 18

Example: Forward mode AD f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) ẇ 3 = ẇ 1 +ẇ 2 + ẇ 1 = cos(x 1 )ẋ 1 ẇ 2 =ẋ 1 x 2 + x 1 ẋ 2 sin( ) * ẋ 1 ẋ 1 ẋ 2 x 1 x 2 (C) Dhruv Batra 19

Example: Forward mode AD f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) ẇ 3 = ẇ 1 +ẇ 2 + ẇ 1 = cos(x 1 )ẋ 1 ẇ 2 =ẋ 1 x 2 + x 1 ẋ 2 sin( ) * ẋ 1 ẋ 1 ẋ 2 x 1 x 2 (C) Dhruv Batra 20

Example: Reverse mode AD f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) + sin( ) * x 1 x 2 (C) Dhruv Batra 21

Example: Reverse mode AD f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) w 3 =1 + w 1 = w 3 w 2 = w 3 sin( ) * x 1 = w 1 cos(x 1 ) x 1 = w 2 x 2 x 2 = w 2 x 1 x 1 x 2 (C) Dhruv Batra 22

Forward Pass vs Forward mode AD vs Reverse Mode AD + f(x 1,x 2 )=x 1 x 2 +sin(x 1 ) sin( ) * x 1 x 2 ẇ 3 = ẇ 1 +ẇ 2 + w 3 =1 + ẇ 1 = cos(x 1 )ẋ 1 ẇ 2 =ẋ 1 x 2 + x 1 ẋ 2 w 1 = w 3 w 2 = w 3 sin( ) ẋ 1 ẋ 1 ẋ 2 * sin( ) x 1 = w 1 cos(x 1 ) x 1 = w 2 x 2 x 2 = w 2 x 1 * x 1 x 2 x 1 x 2 (C) Dhruv Batra 23

Forward mode vs Reverse Mode What are the differences? Which one is more memory efficient (less storage)? Forward or backward? (C) Dhruv Batra 24

Forward mode vs Reverse Mode What are the differences? Which one is more memory efficient (less storage)? Forward or backward? Which one is faster to compute? Forward or backward? (C) Dhruv Batra 25

Plan for Today (Finish) Computing Gradients Forward mode vs Reverse mode AD Patterns in backprop Backprop in FC+ReLU NNs Convolutional Neural Networks (C) Dhruv Batra 26

Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Gradients add at branches + Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop FPROP BPROP SUM + COPY + (C) Dhruv Batra 35

Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) 36 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x * z y (x,y,z are scalars) 37 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x * z y (x,y,z are scalars) 38 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Example: Caffe layers Caffe is licensed under BSD 2-Clause 39 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause 40 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

(C) Dhruv Batra 41

(C) Dhruv Batra 42

Key Computation in DL: Forward-Prop (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun 43

Key Computation in DL: Back-Prop (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun 44

Jacobian of ReLU 4096-d input vector f(x) = max(0,x) (elementwise) 4096-d output vector Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d input vector Q: what is the size of the Jacobian matrix? f(x) = max(0,x) (elementwise) 4096-d output vector 46 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d input vector Q: what is the size of the Jacobian matrix? [4096 x 4096!] f(x) = max(0,x) (elementwise) 4096-d output vector 47 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d input vector Q: what is the size of the Jacobian matrix? [4096 x 4096!] f(x) = max(0,x) (elementwise) 4096-d output vector in practice we process an entire minibatch (e.g. 100) of examples at one time: i.e. Jacobian would technically be a [409,600 x 409,600] matrix :\ Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobian of ReLU 4096-d input vector Q: what is the size of the Jacobian matrix? [4096 x 4096!] f(x) = max(0,x) (elementwise) 4096-d output vector Q2: what does it look like? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobians of FC-Layer (C) Dhruv Batra 50

Jacobians of FC-Layer (C) Dhruv Batra 51

Jacobians of FC-Layer (C) Dhruv Batra 52

Convolutional Neural Networks (without the brain stuff) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Fully Connected Layer Example: 200x200 image 40K hidden units ~2B parameters!!! - Spatial correlation is local - Waste of resources + we have not enough training samples anyway.. Slide Credit: Marc'Aurelio Ranzato 54

Locally Connected Layer Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 55

Locally Connected Layer STATIONARITY? Statistics is similar at different locations Example: 200x200 image 40K hidden units Filter size: 10x10 4M parameters Note: This parameterization is good when input image is registered (e.g., face recognition). Slide Credit: Marc'Aurelio Ranzato 56

Convolutional Layer Share the same parameters across different locations (assuming input is stationary): Convolutions with learned kernels Slide Credit: Marc'Aurelio Ranzato 57

Convolutions for mathematicians (C) Dhruv Batra 58

"Convolution of box signal with itself2" by Convolution_of_box_signal_with_itself.gif: Brian Ambergderivative work: Tinos (talk) - Convolution_of_box_signal_with_itself.gif. Licensed under CC BY-SA 3.0 via Commons - https://commons.wikimedia.org/wiki/file:convolution_of_box_signal_with_itself2.gif#/media/file:convolution_of_box_signal_wi th_itself2.gif (C) Dhruv Batra 59

Convolutions for computer scientists (C) Dhruv Batra 60

Convolutions for programmers (C) Dhruv Batra 61

Convolution Explained http://setosa.io/ev/image-kernels/ https://github.com/bruckner/deepviz (C) Dhruv Batra 62

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 63

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 64

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 65

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 66

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 67

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 68

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 69

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 70

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 71

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 72

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 73

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 74

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 75

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 76

Convolutional Layer (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 77

Convolutional Layer Mathieu et al. Fast training of CNNs through FFTs ICLR 2014 (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 78

Convolutional Layer -1 0 1 * -1 0 1-1 0 1 = (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 79

Convolutional Layer Learn multiple filters. E.g.: 200x200 image 100 Filters Filter size: 10x10 10K parameters (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato 80

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 3072 10 x 3072 weights 1 10 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Fully Connected Layer 32x32x3 image -> stretch to 3072 x 1 input activation 1 3072 10 x 3072 weights 1 10 1 number: the result of taking a dot product between a row of W and the input (a 3072-dimensional dot product) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolutional Layer 83

Convolutional Layer 84

Convolution Layer 32x32x3 image -> preserve spatial structure 32 height 3 32 depth width Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 3 32 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32x32x3 image Filters always extend the full depth of the input volume 5x5x3 filter 32 Convolve the filter with the image i.e. slide over the image spatially, computing dot products 3 32 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32 32x32x3 image 5x5x3 filter 3 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image (i.e. 5*5*3 = 75-dimensional dot product + bias) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer 32 32x32x3 image 5x5x3 filter activation map 28 convolve (slide) over all spatial locations 3 32 1 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Convolution Layer consider a second, green filter 32 32x32x3 image 5x5x3 filter activation maps 28 convolve (slide) over all spatial locations 3 32 1 28 Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

For example, if we had 6 5x5 filters, we ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 3 32 6 28 We stack these up to get a new image of size 28x28x6! Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n