StatPatternRecognition: Status and Plans. Ilya Narsky, Caltech

Similar documents
STAT 503 Case Study: Supervised classification of music clips

PEP-II/BaBar Performance, Accumulated Luminosity. BaBar has submitted over 400 papers for publication (Last year this number was 350)

Detecting Musical Key with Supervised Learning

Improving Performance in Neural Networks Using a Boosting Algorithm

Film Grain Technology

Supervised Learning in Genre Classification

CS 1674: Intro to Computer Vision. Face Detection. Prof. Adriana Kovashka University of Pittsburgh November 7, 2016

Using Boosted Decision Trees to Separate Signal and Background

Data Science + Content. Todd Holloway, Director of Content Science & Algorithms for Smart Content Summit, 3/9/2017

Hidden Markov Model based dance recognition

Release Year Prediction for Songs

Appendix A: Sample Selection

An Introduction to PHP. Slide 1 of :31:37 PM]

On-line Multi-label Classification

Resampling Statistics. Conventional Statistics. Resampling Statistics

Music Composition with RNN

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Music Genre Classification

Modeling memory for melodies

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

CS 7643: Deep Learning

Enabling editors through machine learning

Detection of Panoramic Takes in Soccer Videos Using Phase Correlation and Boosting

Music Genre Classification and Variance Comparison on Number of Genres

Transportation Process For BaBar

VBM683 Machine Learning

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

TEC Long term test software status

Perceptual dimensions of short audio clips and corresponding timbre features

CS229 Project Report Polyphonic Piano Transcription

Outline. Why do we classify? Audio Classification

Automatic Music Genre Classification

)454 ( ! &!2 %.$ #!-%2! #/.42/, 02/4/#/, &/2 6)$%/#/.&%2%.#%3 53).' ( 42!.3-)33)/. /&./.4%,%0(/.% 3)'.!,3. )454 Recommendation (

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

MC9211 Computer Organization

A Framework for Segmentation of Interview Videos

Singer Traits Identification using Deep Neural Network

Exploring the Design Space of Symbolic Music Genre Classification Using Data Mining Techniques Ortiz-Arroyo, Daniel; Kofod, Christian

Neural Network Predicating Movie Box Office Performance

A Novel Bus Encoding Technique for Low Power VLSI

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

System Identification

Cryptanalysis of LILI-128

Impact of Deep Learning

Real-Time Systems Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Introduction to IBM SPSS Statistics (v24)

A Matlab toolbox for. Characterisation Of Recorded Underwater Sound (CHORUS) USER S GUIDE

gresearch Focus Cognitive Sciences

Feature Conditioning Based on DWT Sub-Bands Selection on Proposed Channels in BCI Speller

Brain-Computer Interface (BCI)

2. Logic Elements and Logic Array Blocks in the Cyclone III Device Family

Commissioning and Initial Performance of the Belle II itop PID Subdetector

Audio-Based Video Editing with Two-Channel Microphone

ILDA Image Data Transfer Format

Phone-based Plosive Detection

Discriminant Analysis. DFs

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

MICROECONOMETRICS USING STATA, REVISED EDITION 2ND (SECOND) EDITION BY A. COLIN CAMERON

Machine Learning: finding patterns

DCI Requirements Image - Dynamics

Oculomatic Pro. Setup and User Guide. 4/19/ rev

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Analysis and Clustering of Musical Compositions using Melody-based Features

Noise Flooding for Detecting Audio Adversarial Examples Against Automatic Speech Recognition

Image Steganalysis: Challenges

Reconfigurable Neural Net Chip with 32K Connections

Jazz Melody Generation and Recognition

POL 572 Multivariate Political Analysis

One year of developments and collaborations around the MinION on the Genomic facility of the IBENS.

Creating a Feature Vector to Identify Similarity between MIDI Files

Combinational vs Sequential

Evaluation of Serial Periodic, Multi-Variable Data Visualizations

Mapping Document. Issue date: 27 February 2014

Unit V Design for Testability

Deploying IP video over DOCSIS

CSE 352 Laboratory Assignment 3

COMPUTER ENGINEERING PROGRAM

Australasian Computer Music Conference: Interactive Conference Proceedings

Feature-Based Analysis of Haydn String Quartets

Composer Style Attribution

ANSI/SCTE

ITU-T Y Specific requirements and capabilities of the Internet of things for big data

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

UWE has obtained warranties from all depositors as to their title in the material deposited and as to their right to deposit such material.

What is the history and background of the auto cal feature?

AutoChorale An Automatic Music Generator. Jack Mi, Zhengtao Jin

DeepID: Deep Learning for Face Recognition. Department of Electronic Engineering,

D-BOX in SMPTE/DCI DCP

Sampler Overview. Statistical Demonstration Software Copyright 2007 by Clifford H. Wagner

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

VISUAL CONTENT BASED SEGMENTATION OF TALK & GAME SHOWS. O. Javed, S. Khan, Z. Rasheed, M.Shah. {ojaved, khan, zrasheed,

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Statistical Analysis and Optimization of Parametric Delay Test

The Bias-Variance Tradeoff

VIBRIO. User Manual. by Toast Mobile

Example: compressing black and white images 2 Say we are trying to compress an image of black and white pixels: CSC310 Information Theory.

HEBS: Histogram Equalization for Backlight Scaling

Transcription:

StatPatternRecognition: Status and Plans, Caltech

Outline Package distribution and management Implemented classifiers and other tools User interface Near-future plans and solicitation This is a technical talk on capabilities of the package. Not intended to explain how classifiers work or what classifiers you should be using in what situations. Please refer to http://www.hep.caltech.edu/~narsky/spr.html http://www-group.slac.stanford.edu/sluo/lectures/stat2006_lectures.html for more info on theory. BaBar Meeting, February 2007 2

Distribution Between summer 2005 and summer 2006, SPR was distributed outside Babar as a standalone package with a mandatory dependency on CLHEP and an optional dependency on Root In summer 2006 SPR was stripped of CLHEP dependency and equipped with autotools (thanks to Andy Buckley and Xuan Luo) In September 2006 SPR was posted at Sourceforge: https://sourceforge.net/projects/statpatrec Since then, Sourceforge is the main source of SPR code. SPR versions elsewhere (including the one in Babar CVS) are not up-to-date. It is planned to write scripts for conversion of the Sourceforge version to Babar and CMS environments. BaBar Meeting, February 2007 3

History at Sourceforge Since September 2006 several new methods and interface utilities have been implemented. 9 tags released. See further slides. Downloaded ~220 times. 26 subscribers on the Sourceforge mailing list My coworker submitted SPR into Fedora package repository. Will be part of OS soon BaBar Meeting, February 2007 4

User builds SPR can be built in two versions: standalone with Ascii I/O; no external dependencies Root I/O Desired build options are chosen by the./configure script. See INSTALL. Run./configure --help for a full list of options. The Ascii version of the package builds smoothly. The Root version build can bail out with errors related to Root classes. The fix is to change compilation flags. Again, see INSTALL for detail. Successful builds on 32-bit SL3, SL4, RH4 and 64-bit Fedora. Enthusiasts adapted SPR to WinXP and MacOS. BaBar Meeting, February 2007 5

Implemented classifiers Decision split, or stump Decision trees (2 flavors) Bump hunter (PRIM, Friedman & Fisher) LDA (aka Fisher) and QDA Logistic regression Boosting: discrete AdaBoost, real AdaBoost, and epsilon-boost Arc-x4 (a variant of boosting from Breiman) Bagging Random forest Backprop neural net with a logistic activation function (original implementation) Multi-class learner (Allwein, Schapire and Singer) Interfaces to SNNS neural nets (without training): Backprop neural net, and Radial Basis Functions BaBar Meeting, February 2007 6

A tiny bit of theory Logistic regression Like Fisher, computes an optimal linear boundary between two classes. Unlike Fisher, does not assume that the two classes have multivariate Gaussian distributions. Computes the boundary under more general assumptions. Boosting: Discrete AdaBoost computes misclassification rate for each weak classifier and increases weights of misclassified events by the amount derived from this misclassification rate epsilon-boost increases weights of misclassified events by a fixed amount specified by the user Real AdaBoost multiplies weights of signal events by (1-r)/r and weights of background events by r/(1-r), where r is the weak classifier response for this event BaBar Meeting, February 2007 7

Other tools (reminder) Cross-validation Bootstrap Tools for variable selection Decision trees and bump hunter can optimize any of 10 figures of merit implemented in the package Computation of data moments (mean, variance, covariance, kurtosis etc) Arbitrary grouping of input classes in two categories (signal and background) Multivariate GoF method proposed by Friedman at Phystat 2003 BaBar Meeting, February 2007 8

Boost and bag anything you like SPR was always capable of boosting and bagging any classifier in the package if you were willing to do a bit of C++ coding Now you can boost and bag an arbitrary sequence of classifiers using SprBoosterApp and SprBaggerApp executables. All you need is specify classifier name and params in the input config file. See booster.config for examples. Example: Boost decision tree (odd cycles) and neural net (even cycles) cat booster.config TopdownTree 5 0 8000 StdBackprop 30:15:7:1 100 0.1 0 0.1 SprBoosterApp M 2 n 100 g 2 t test.pat d 1 train.pat booster.config (run 100 training cycles with Real AdaBoost and display exponential loss on test data)

What should you boost and bag? Boosted and bagged (plus random forest) decision trees are not news anymore. Enough examples in HEP analysis. Boosted random forests Byron Roe et al., physics/0508045, PID at MiniBOONE. Performance same as that of optimal boosted decision trees. Boosted neural nets Statistics literature: plenty, e.g., see next slide Physics literature: Meiling, Mingmei and Lianshou, hepph/0606257, classification of quark and gluon jets from e+ecollisions at 91 GeV Bagged neural nets none in physics; examples in stats literature Alternated boosting of decision tree and neural net None to my knowledge. Seems like an obvious thing to try. BaBar Meeting, February 2007 10

Boosted and bagged neural nets H. Drucker, "Boosting Using Neural Networks", in Combining Artificial Neural Nets, ed. A. Sharkey, Springer Series in Perspectives in Neural Computing Compares DT, boosted DT, NN, bagged NN and boosted NN. Boosted NN gives minimal classification error for all examples. 100-D data with 120k training events BaBar Meeting, February 2007 11

Multi-class learner Method by Allwein, Schapire and Singer. Reduces any multi-class problem to a set of binary problems according to the user-specified class matrix. Popular strategies are one-vs-one and one-vs-all. Was implemented in SPR a long time ago but there used to be only one executable for multi-class learning with boosted binary splits. This executable has been replaced with SprMultiClassApp capable of using any classifier. Again, the user needs to choose the binary classifier in an input config file using syntax identical to booster.config. A beer pack for the first serious user of this algorithm is still waiting BaBar Meeting, February 2007 12

User tools (recent additions) SprInteractiveAnalysisApp Interactive selection of classifiers and comparison of their performance on test data Should be used only on small datasets in not too many dimensions SprOutputWriterApp For reading stored classifier configurations and applying them to test data. Can handle any classifier and can read several classifiers at once. See example on next slide. In tags prior to V05-00-00 each classifier had an individual configuration reader, e.g., SprAdaBoostDecisionTreeReader, SprBaggerDecisionTreeReader etc. In tag V05-00-00 all readers were replaced by a single class, SprClassifierReader. This change is backwards-incompatible. You won t be able to read from classifier configuration files produced by earlier tags. SprOutputAnalyzerApp For people who don t like Root. Prints out efficiency curves for data and classifier responses stored in Ascii format. BaBar Meeting, February 2007 13

Example: Analysis of Big Datasets SprBoosterApp n 300 f boost.spr train.pat booster.config (train 300 cycles of Discrete AdaBoost and save classifier configuration into boost.spr) SprBaggerApp n 100 f bagger.spr train.pat bagger.config (train 100 cycles of bagger and save configuration into bagger.spr) SprOutputWriterApp -C boost,bag booster.spr,bagger.spr test.pat test.root (Read classifier configurations from booster.spr and bagger.spr and apply them to test data. Save input variables and classifier output into test.root. Classifier responses will be saved with names boost and bag.) This will work for any classifiers specified by user in booster.config and bagger.config due to the unified interface implemented in SprClassifierReader. BaBar Meeting, February 2007 14

Plans Write a set of scripts to adapt the Sourceforge version to Babar and CMS environments. Someone at Caltech promised to do this for CMS. The Babar part will be my burden unless Volunteers? Build a web service using Clarens framework. Clarens has been developed at Caltech and used for the most part at CMS. Use a web client to submit jobs to remote published resources. If you are interested in contributing, I have a laundry list of methods that would be good to implement. User feedback is appreciated. Not necessarily bug reports but mere statements we applied SPR to our analysis and obtained such results would be helpful. BaBar Meeting, February 2007 15