DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

Similar documents
An Introduction to Deep Image Aesthetics

Joint Image and Text Representation for Aesthetics Analysis

Singer Traits Identification using Deep Neural Network

Predicting Aesthetic Radar Map Using a Hierarchical Multi-task Network

Automatic Laughter Detection

CS 1674: Intro to Computer Vision. Face Detection. Prof. Adriana Kovashka University of Pittsburgh November 7, 2016

Deep Aesthetic Quality Assessment with Semantic Information

IMAGE AESTHETIC PREDICTORS BASED ON WEIGHTED CNNS. Oce Print Logic Technologies, Creteil, France

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Photo Aesthetics Ranking Network with Attributes and Content Adaptation

Free Viewpoint Switching in Multi-view Video Streaming Using. Wyner-Ziv Video Coding

arxiv: v2 [cs.cv] 27 Jul 2016

CS229 Project Report Polyphonic Piano Transcription

Semantic Image Segmentation via Deep Parsing Network

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

Chord Classification of an Audio Signal using Artificial Neural Network

Adaptive Distributed Compressed Video Sensing

CS 1674: Intro to Computer Vision. Intro to Recognition. Prof. Adriana Kovashka University of Pittsburgh October 24, 2016

LEARNING AUDIO SHEET MUSIC CORRESPONDENCES. Matthias Dorfer Department of Computational Perception

Decision-Maker Preference Modeling in Interactive Multiobjective Optimization

Image Steganalysis: Challenges

Efficient Implementation of Neural Network Deinterlacing

arxiv: v1 [cs.sd] 5 Apr 2017

Indexing local features. Wed March 30 Prof. Kristen Grauman UT-Austin

Lecture 9 Source Separation

Automatic Laughter Detection

First Step Towards Enhancing Word Embeddings with Pitch Accents for DNN-based Slot Filling on Recognized Text

The Million Song Dataset

WYNER-ZIV VIDEO CODING WITH LOW ENCODER COMPLEXITY

Generating Chinese Classical Poems Based on Images

gresearch Focus Cognitive Sciences

GENDER IDENTIFICATION AND AGE ESTIMATION OF USERS BASED ON MUSIC METADATA

MUSI-6201 Computational Music Analysis

SentiMozart: Music Generation based on Emotions

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Detecting Musical Key with Supervised Learning

A Discriminative Approach to Topic-based Citation Recommendation

Using Variational Autoencoders to Learn Variations in Data

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Problem. Objective. Presentation Preview. Prior Work in Use of Color Segmentation. Prior Work in Face Detection & Recognition

Learning Joint Statistical Models for Audio-Visual Fusion and Segregation

Advanced Video Processing for Future Multimedia Communication Systems

International Journal for Research in Applied Science & Engineering Technology (IJRASET) Motion Compensation Techniques Adopted In HEVC

Experiments on musical instrument separation using multiplecause

Chapter 2 Introduction to

Identifying Table Tennis Balls From Real Match Scenes Using Image Processing And Artificial Intelligence Techniques

Automatic Piano Music Transcription

Finding Sarcasm in Reddit Postings: A Deep Learning Approach

A CLASSIFICATION-BASED POLYPHONIC PIANO TRANSCRIPTION APPROACH USING LEARNED FEATURE REPRESENTATIONS

Impact of scan conversion methods on the performance of scalable. video coding. E. Dubois, N. Baaziz and M. Matta. INRS-Telecommunications

Neural Network for Music Instrument Identi cation

Discriminative and Generative Models for Image-Language Understanding. Svetlana Lazebnik

Improving Frame Based Automatic Laughter Detection

An AI Approach to Automatic Natural Music Transcription

Music Composition with RNN

Enabling editors through machine learning

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition

CS 7643: Deep Learning

Automatic Musical Pattern Feature Extraction Using Convolutional Neural Network

Robust 3-D Video System Based on Modified Prediction Coding and Adaptive Selection Mode Error Concealment Algorithm

Topics in Computer Music Instrument Identification. Ioanna Karydi

Representations of Sound in Deep Learning of Audio Features from Music

A Study of Predict Sales Based on Random Forest Classification

Structured training for large-vocabulary chord recognition. Brian McFee* & Juan Pablo Bello

AN IMPROVED ERROR CONCEALMENT STRATEGY DRIVEN BY SCENE MOTION PROPERTIES FOR H.264/AVC DECODERS

EyeFace SDK v Technical Sheet

IDENTIFYING TABLE TENNIS BALLS FROM REAL MATCH SCENES USING IMAGE PROCESSING AND ARTIFICIAL INTELLIGENCE TECHNIQUES

Music Similarity and Cover Song Identification: The Case of Jazz

Rebroadcast Attacks: Defenses, Reattacks, and Redefenses

Music Genre Classification and Variance Comparison on Number of Genres

Reducing False Positives in Video Shot Detection

arxiv: v1 [cs.lg] 15 Jun 2016

FOIL it! Find One mismatch between Image and Language caption

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

arxiv: v1 [cs.sd] 18 Oct 2017

Speech To Song Classification

Detecting the Moment of Snap in Real-World Football Videos

A Survey of Audio-Based Music Classification and Annotation

Release Year Prediction for Songs

Supplementary material for Inverting Visual Representations with Convolutional Networks

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

WITH the rapid development of high-fidelity video services

Generic object recognition

Music Information Retrieval with Temporal Features and Timbre

Multi-modal Analysis for Person Type Classification in News Video

Automatic Rhythmic Notation from Single Voice Audio Sources

arxiv: v1 [cs.cv] 2 Nov 2017

Xuelong Li, Thomas Huang. University of Illinois at Urbana-Champaign

SCALABLE video coding (SVC) is currently being developed

Sarcasm Detection in Text: Design Document

Analysis of Packet Loss for Compressed Video: Does Burst-Length Matter?

Composer Style Attribution

Joint Optimization of Source-Channel Video Coding Using the H.264/AVC encoder and FEC Codes. Digital Signal and Image Processing Lab

Video coding standards

LOUDNESS EFFECT OF THE DIFFERENT TONES ON THE TIMBRE SUBJECTIVE PERCEPTION EXPERIMENT OF ERHU

A Survey on: Sound Source Separation Methods

Transcription:

DeepID: Deep Learning for Face Recognition Xiaogang Wang Department of Electronic Engineering, The Chinese University i of Hong Kong

Machine Learning with Big Data Machine learning with small data: overfitting, reducing model complexity (capacity), adding regularization Machine learning with big data: underfitting, increasing model complexity, g g g, g p y, optimization, computation resource

Face Recognition Face verification: binary classification Verify two images belonging to the same person or not? Face identification: multi class classification classify an image into one of N identity classes

Labeled Faces in the Wild (2007) Best results without deep learning

Learn face representations from face verification, identification, multi view reconstruction Properties of face representations sparseness, selectiveness, robustness Sparsify the network sparseness, selectiveness Applications of face representations face localization, li attribute t recognition

Learn face representations from face verification, identification, multi view reconstruction Properties of face representations sparseness, selectiveness, robustness Sparsify the network sparseness, selectiveness Applications of face representations face localization, li attribute t recognition

Key challenge on face recognition Intra personal variation Inter personal variation How to separate the two types of variations?

Learning feature representations Training stage A Training stage B Dataset A Dataset B feature transform feature transform Fixed Classifier A Linear classifier B The two images Reconstruct Distinguish belonging to the 10,000 faces Task people Ain same person or not (identification) multiple views (verification) The two images belonging Task Bto the same person or not Face verification

Learn face representations from Predicting binary labels (verification) Prediction becomes richer Prediction becomes more challenging Supervision becomes stronger Feature learningbecomes more effective Predicting multi class labels (identification) Predicting thousands of real valued pixels (multi view) i reconstruction ti

Learn face representations with verification signal Extract relational features with learned filter pairs These relational features are further processed through multiple layers to extract global features The fully connected layer is the feature representation Y. Sun, X. Wang, and X. Tang, Hybrid Deep Learning for Computing Face Similarities, Proc. ICCV, 2013.

DeepID: Learn face representations with identification signal (1, 0, 0) (0, 1, 0) (0, 0, 1) Y. Sun, X. Wang, and X. Tang, Deep Learning Face Representation from Predicting 10,000 classes, Proc. CVPR, 2014.

DeepID2: Joint Identification (Id) Verification (Ve) Signals (Id) Y. Sun, X. Wang, and X. Tang. NIPS, 2014.

Learning face representation from recovering canonical view face images Julie Cindy Reconstruction examples from LFW Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning Identity Preserving Face Space, ICCV 2013.

Disentangle factors through feature extraction over multiple layers No 3D model; dlno prior information on pose and lighting condition Model multiple complex transforms Reconstructing ti the whole face is a much strong supervision ii than predicting 0/1 class label Arbitrary view Canonical view

It is still not a 3D representation yet Can we reconstruct all the views?

Output Image y 1 (0 o ) y 2 (45 o ) y 3 (90 o ) Hidden Layer n A multi task solution: discretize the view spectrum Input Image 1. The number of views to be reconstructed is predefined, equivalent to the number of tasks 2. Cannot reconstruct views not presented in the training set 3. Encounters problems when the training data of different views are unbalanced 4. Model complexity increases as the number of views

Deep learning multi view representation from 2D images Given an image under arbitrary view, its viewpoint can be estimated and its full spectrum of views can be reconstructed Continuous view representation tti Identity and view represented by different sets of neurons Jackie Feynman Feynman Z. Zhu, P. Luo, X. Wang, and X. Tang, Deep Learning and Disentangling Face Representation by Multi View Perception, NIPS 2014.

Network is composed of deterministic neurons and random neurons x and y are input and output images of the same identity but in different views; vis the view label of the output image; g; h id are neurons encoding identity features es h v are neurons encoding view features h r are neurons encoding features to reconstruct the output images

Deep Learning by EM EM updates on the probabilistic model are converted to forward and backward propagation E-step: proposes s samples of h M-step: compute gradient refer to h with largest w s

Face recognition accuracies across views and illuminations on the Multi PIE dataset. The first and the second best performances are in bold.

Deep Learning Multi view Representation from 2D Images Interpolate and predict images under viewpoints unobserved in the training set The training set only has viewpoints of 0 o, 30 o, and 60 o. (a): the reconstructed images under 15 o and 45 o when the input is taken under 0 o. (b) The input images are under 15 o and 45 o.

Generalize to other facial factors Label of View Label of Age y v Output Image Hidden Layer n View Age h id h v Identity View Age Random Neurons Input Image x

Face reconstruction across poses and expressions

Face reconstruction across lightings and expressions

Learn face representations from face verification, identification, multi view reconstruction Properties of face representations sparseness, selectiveness, robustness Sparsify the network sparseness, selectiveness Applications of face representations face attribute recognition, face localization Y. Sun, X. Wang, and X. Tang, CVPR 2015

Deeply learned features are moderately sparse The binary codes on activation patterns are very effective on face recognition Save storage and speedup face search dramatically Activation patterns are more important than activation magnitudes in face recognition Combined model (real values) Joint Bayesian (%) Hamming distance (%) 99.47 n/a Combined model 99.12 97.47 (binary code)

Deeply learned features are moderately sparse 1 0 1 1 0 0 0 1 0 0 1 1 6 Moderately sparse 1 0 0 0 0 0 0 1 0 0 0 0 For an input image, about half of the neurons are activated Highly sparse Maximize the Hamming distance between images 2

Deeply learned features are moderately sparse Responses of a particular neuron on all the images An neuron has response on about half of p the images Maximize the discriminative power (entropy) of a neuron on describing the image set

Deeply learned features are selective to identities and attributes With a single neuron, DeepID2 reaches 97% recognition accuracy for some identity and attribute

Deeply learned features are selective to identities and attributes Excitatory and inhibitory neurons (on identities) Neuron 56 Neuron 78 Neuron 344 Neuron 298 Neuron 157 Neuron 116 Neuron 328 Neuron 459 Neuron 247 Neuron 131 Neuron 487 Neuron 103 Neuron 291 Neuron 199 Neuron 457 Neuron 461 Neuron 473 Neuron 405 Neuron 393 Neuron 445 Neuron 328 Neuron 235 Neuron 98 Neuron 110 Neuron 484 Histograms of neural activations over identities with the most images in LFW

Neuron 38 Neuron 50 Neuron 462 Neuron 354 Neuron 418 Neuron 328 Neuron 316 Neuron 496 Neuron 484 Neuron 215 Neuron 5 Neuron 17 Neuron 432 Neuron 444 Neuron 28 Neuron 152 Neuron 105 Neuron 140 Neuron 493 Neuron 237 Neuron 12 Neuron 498 Neuron 342 Neuron 330 Neuron 10 Neuron 61 Neuron 73 Neuron 322 Neuron 410 Neuron 398

Deeply learned features are selective to identities and attributes Excitatory and inhibitory neurons (on attributes) Neuron 77 Neuron 361 Neuron 65 Neuron 873 Neuron 117 Neuron 3 Neuron 491 Neuron 63 Neuron 75 Neuron 410 Histograms of neural activations over gender related attributes (Male and Female) Neuron 444 Neuron 448 Neuron 108 Neuron 421 Neuron 490 Neuron 282 Neuron 241 Neuron 444 Histograms of neural activations over race related attributes (White, Black, Asian and India)

Neuron 205 Neuron 186 Neuron 249 Neuron 40 Neuron 200 Neuron 61 Neuron 212 Neuron 200 Neuron 106 Neuron 249 Histogram of neural activations over age related attributes (Baby, Child, Youth, Middle Aged, and Senior) Neuron 36 Neuron 163 Neuron 212 Neuron 281 Neuron 122 Neuron 50 Neuron 406 Neuron 96 Neuron 167 Neuron 245 Histogram of neural activations over hair related attributes (Bald, Black Hair, Gray Hair, Blond Hair, and Brown Hair.

Deeply learned features are selective to identities and attributes With a single neuron, DeepID2 reaches 97% recognition accuracy for some identity and attribute Identity classification accuracy on LFW with one single DeepID2+ or LBP feature. GB, CP, TB, DR, and GS are five celebrities lbiti with the most images in LFW. Attribute classification accuracy on LFW with one single DeepID2+ or LBP feature.

Excitatory and Inhibitory neurons DeepID2+ High dim LBP

Excitatory and Inhibitory neurons DeepID2+ High dim LBP

Excitatory and Inhibitory neurons DeepID2+ High dim LBP

Deeply learned features are selective to identities and attributes Visualize the semantic meaning of each neuron

Deeply learned features are selective to identities and attributes Visualize the semantic meaning of each neuron Neurons are ranked by their responses in descending order with respect to test images

Deeply learned features are robust to occlusions Global features are more robust to occlusions

Learn face representations from face verification, identification, multi view reconstruction Properties of face representations sparseness, selectiveness, robustness Sparsify the network according to neural selectiveness sparseness, selectiveness Applications of face representations face localization, li i attribute recognition ii

Attribute 1 Attribute K Yi Sun, Xiaogang Wang, and Xiaoou Tang, Sparsifying Neural Network Connections for Face Recognition, arxiv:1512.01891, 2015

Attribute 1 Attribute K Explore correlations between neurons in different layers

Attribute 1 Attribute K Explore correlations between neurons in different layers

Alternatively learning weights and net structures 1. Train a dense network from scratch 2. Sparsify the top layer, and re train the net 3. Sparsify the second top layer, and re train the net Conel, JL. The postnatal development of the human cerebral cortex. Cambridge, Mass: Harvard University Press, 1959.

Original deep neural network Sparsified ddeep neural network and only keep 1/8 amount of parameters after joint optimization of weights and structures Train the sparsified network from scratch 98.95% 99.3% 98.33% The sparsified network has enough learning capacity, but the original denser network helps it reach a better intialization

Learn face representations from face verification, identification, multi view reconstruction Properties of face representations sparseness, selectiveness, robustness Sparsify the network according to neural selectiveness sparseness, selectiveness Applications of face representations face localization, li i attribute recognition ii

DeepID2 features for attribute recognition DeepID2 features can be directly used for attribute recognition Use DeeID2 features as initialization (pre trained result), and then fine tune on attribute t recognition Multi task learning face recognition and attribute prediction does not improve performance, because face recognition is a much stronger supervision than attribute prediction Average accuracy on 40attributes on CelebA and LFWA datasets CelebA FaceTracer [1] (HOG+SVM) 81 74 Training CNN from scratch with attributes 83 79 Directly use DeepID2 features 84 82 DeepID2 + fine tuning 87 84 LFWA

Features learned from face recognition can improve face localization? Single face detector Hard to handle largevariety especially on views View 1 View N Multi view detector View labels are given in training; Each detector handles a view Push the idea to extreme? Viewpoints Gender, expression, race, hair style Attributes Neurons have selectiveness on attributes A filter (or a group of filters) functions as a detector of a face attribute When a subset of neurons are activated, they indicate existence of faces with an attribute configuration

Attribute configuration 1 Attribute configuration 2 Brow hair Male Big eyes Black hair Smiling Sunglasses The neurons at different layers can form many activation patterns, implying that the whole set of face images can be divided into many subsets based on attribute configurations

LNet localizes faces LNet is pre trained with face recognition and fine tuned with attribute prediction By simply pyaveraging g response maps andgood face localization is achieved Z. Liu, P. Luo, X. Wang, and X. Tang, Deep Learning Face Attributes in the Wild, ICCV 2015

(a) (b) (a) ROC curves of LNet and state of the art face detectors (b) Recall rates w.r.t. number of attributes (FPPI = 0.1)

Attribute selectiveness: neurons serve as detectors Identity selectiveness: neurons serve as trackers L. Wang, W. Ouyang, X. Wang, and H. Lu, Visual Tracking with Fully Convolutional Networks, ICCV 2015.

Conclusions Face representation can be learned from the tasks of verification, identification, and multi view reconstruction Deeply pylearned features are moderately sparse, identity and attribute selective, and robust to data corruption The net can be sparsified substantially by alternatively optimizing the weights and structures Because of these properties, the learned face representation are effective for applications beyond face recognition, such as face localization and attribute prediction

Collaborators Yi Sun Ziwei Liu Zhenyao Zhu Ping Luo Xiaoou Tang

Thank you! http://mmlab.ie.cuhk.edu.hk/ http://www.ee.cuhk.edu.hk/~xgwang/