STAT 503 Case Study: Supervised classification of music clips

Similar documents
Does the number of users rating the movie accurately predict the average user rating?

More About Regression

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

Algebra I Module 2 Lessons 1 19

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

StatPatternRecognition: Status and Plans. Ilya Narsky, Caltech

Homework Packet Week #5 All problems with answers or work are examples.

MOZART S PIANO SONATAS AND THE THE GOLDEN RATIO. The Relationship Between Mozart s Piano Sonatas and the Golden Ratio. Angela Zhao

Release Year Prediction for Songs

Modeling memory for melodies

Automatic LP Digitalization Spring Group 6: Michael Sibley, Alexander Su, Daphne Tsatsoulis {msibley, ahs1,

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Detecting Musical Key with Supervised Learning

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Resampling Statistics. Conventional Statistics. Resampling Statistics

Discriminant Analysis. DFs

COMP Test on Psychology 320 Check on Mastery of Prerequisites

MUSI-6201 Computational Music Analysis

IMDB Movie Review Analysis

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

Music Genre Classification and Variance Comparison on Number of Genres

Frequencies. Chapter 2. Descriptive statistics and charts

What is Statistics? 13.1 What is Statistics? Statistics

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Measuring Variability for Skewed Distributions

MATH& 146 Lesson 11. Section 1.6 Categorical Data

Chapter 6. Normal Distributions

CS229 Project Report Polyphonic Piano Transcription

(Week 13) A05. Data Analysis Methods for CRM. Electronic Commerce Marketing

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Estimation of inter-rater reliability

Composer Style Attribution

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

1. Model. Discriminant Analysis COM 631. Spring Devin Kelly. Dataset: Film and TV Usage National Survey 2015 (Jeffres & Neuendorf) Q23a. Q23b.

NETFLIX MOVIE RATING ANALYSIS

DV: Liking Cartoon Comedy

Statistics for Engineers

Visual Encoding Design

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

Perceptual dimensions of short audio clips and corresponding timbre features

Latin Square Design. Design of Experiments - Montgomery Section 4-2

TI-Inspire manual 1. Real old version. This version works well but is not as convenient entering letter

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Chapter 1 Midterm Review

Neural Network Predicating Movie Box Office Performance

Time Domain Simulations

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

SECTION I. THE MODEL. Discriminant Analysis Presentation~ REVISION Marcy Saxton and Jenn Stoneking DF1 DF2 DF3

Chapter 3. Averages and Variation

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Outline. Why do we classify? Audio Classification

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

1.1 Common Graphs and Data Plots

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Results of Vibration Study for LCLS-II Construction in FEE, Hutch 3 LODCM and M3H 1

Box Plots. So that I can: look at large amount of data in condensed form.

Composer Identification of Digital Audio Modeling Content Specific Features Through Markov Models

Relationships Between Quantitative Variables

A Framework for Segmentation of Interview Videos

Hidden Markov Model based dance recognition

Attacking of Stream Cipher Systems Using a Genetic Algorithm

Sociology 7704: Regression Models for Categorical Data Instructor: Natasha Sarkisian

Extraction Methods of Watermarks from Linearly-Distorted Images to Maximize Signal-to-Noise Ratio. Brandon Migdal. Advisors: Carl Salvaggio

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Automatic Polyphonic Music Composition Using the EMILE and ABL Grammar Inductors *

Normalization Methods for Two-Color Microarray Data

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Relationships. Between Quantitative Variables. Chapter 5. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Statistics For Dummies PDF

Feature-Based Analysis of Haydn String Quartets

Chapter 2 Describing Data: Frequency Tables, Frequency Distributions, and

Supervised Learning in Genre Classification

LCD and Plasma display technologies are promising solutions for large-format

HUMAN PERCEPTION AND COMPUTER EXTRACTION OF MUSICAL BEAT STRENGTH

Special Article. Prior Publication Productivity, Grant Percentile Ranking, and Topic-Normalized Citation Impact of NHLBI Cardiovascular R01 Grants

The Relationship Between Movie Theatre Attendance and Streaming Behavior. Survey insights. April 24, 2018

Features for Audio and Music Classification

Chapter 4. Displaying Quantitative Data. Copyright 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley

Sampling Worksheet: Rolling Down the River

Graphical Displays of Univariate Data

Mixed Effects Models Yan Wang, Bristol-Myers Squibb, Wallingford, CT

Draft last edited May 13, 2013 by Belinda Robertson

Phenopix - Exposure extraction

1 Introduction to the life course perspective. 2 Working with life course data. 3 Familial life course analysis. 4 Visualization.

Creating a Feature Vector to Identify Similarity between MIDI Files

SWITCHED INFINITY: SUPPORTING AN INFINITE HD LINEUP WITH SDV

In Chapter 4 on deflection measurement Wöhler's scratch gage measured the bending deflections of a railway wagon axle.

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

Comparing Distributions of Univariate Data

Topics in Computer Music Instrument Identification. Ioanna Karydi

Measuring Playlist Diversity for Recommendation Systems

Dot Plots and Distributions

PulseCounter Neutron & Gamma Spectrometry Software Manual

Transcription:

STAT 503 Case Study: Supervised classification of music clips 1 Data Description This data was collected by Dr Cook from her own CDs. Using a Mac she read the track into the music editing software Amadeus II, snipped and saved the first 40 seconds as a WAV file. (WAV is an audio format developed by Microsoft, commonly used on Windows but it is getting less popular.) These files were read into R using the package tuner. This converts the audio file into numeric data. All of the CDs contained left and right channels, and variables were calculated on both channels. The resulting data has 57 rows (cases) and 72 columns (variables). LVar, LAve, LMax, RVar, RAve, RMax: average, variance, maximum of the frequencies of the left and right channels, respectively. LPer1-LPer15, LFreq1-LFreq15, RPer1-RPer15, RFreq1-RFreq15: height and frequency of the highest peak in the periodogram. LFEner, RFEner: an indicator of the amplitude or loudness of the sound. LFVar, RFVar: variance in the frequencies as computed by the periodogram function. There are 30 tracks by Abba, the Beatles and the Eels, which would be considered to be Rock, and 24 tracks by Vivaldi, Mozart and Beethoven, considered to be Classical. The main question we want to answer is: Can Rock tracks be distinguished from Classical tracks using the given variables? Other questions of interest might be: How does Enya compare to Rock and Classical tracks? Are there differences from CD to CD? Are there differences between the tracks of different artists? Is any difference between Rock and Classical due to voice vs no voice? 1

2 Suggested approaches Approach Reason Type of questions addressed Data Restructuring Summarize and possible impute missing values. Divide the data into training and test sets. Select the most important variables. Summary statistics Plots Numerical classifiers Tabulate averages and standard deviations for the important variables, for each group. Univariate plots and scatterplots of important variables. LDA, QDA, logistic regression, trees and random forests Is there a difference in average left channel variance in frequency for Rock and Classical tracks? Are there differences between rock and classical tracks? How do we predict a new track to be either Rock or Classical? 2

3 Actual Results 3.1 Data restructuring 3.1.1 Missing Values Number of missings Number of Variables 0 42 1 18 2 9 3 2 Table 1: Number of missings by variable. Table?? contains a tabulation of the number of missings on each variable. The response variable, Type, has no missing values. Most of the predictor variables have no missing values. The missing values are concentrated in the Freq variables. Most of these Freq variables have 1 missing value (LFreq1-4,LFreq6-8, LFreq10-15, RFreq1-4, RFreq13), some have 2 (LFreq9, RFreq5-10,RFreq14-15), and a couple have 3 (RFreq11-12). LFreq5 has no missing values. Tracks that have missings are Track Num Missing Variable(s) The Winner 1 LFreq Cant.Buy.Me.Love 3 RFreq10-12 I.Feel.Fine 9 RFreq5-9,RFreq11-12,RFreq14-15 Beethoven 2 29 LFreq1-LFreq4,LFreq6-LFreq15,RFreq1-15 Maybe imputing the values for Beethoven 2 will be enough, and the other missings might be eliminated by choosing variables with no missings. Beethoven 2 is most similar to Vivaldi s 2, 4, 8 tracks, on the variables LVar, LAve, LMax, RVar, RAve, RMax, and LFEner, RFEner, LFVar, RFVar. But on the one non-missing Freq value, LFreq5 it is very different from these tracks. It might not be so easy to impute the missings for this track. We ll use random forests to get some help in deciding important variables, and then decide what to do with the missing values. This is the top 10 variables according to MeanDecreaseAccuracy, and according to MeanDecreaseGini: 3

Variable MeanDecAcc MeanDecGini LVar 1.47 0.81 RAve 1.28 0.57 RFreq14 1.23 0.63 RMax 1.16 0.65 LFEner 1.15 0.55 LFreq7 1.12 0.56 RFreq13 1.09 0.83 LFVar 1.08 0.54 LFreq13 1.02 0.49 LFreq5 0.94 0.49 Variable MeanDecAcc MeanDecGini RFreq13 1.09 0.83 LVar 1.47 0.81 LFreq12 0.81 0.76 RFreq12 0.92 0.73 LFreq2 0.87 0.66 RMax 1.16 0.65 RFreq14 1.23 0.63 RAve 1.28 0.57 LFreq7 1.12 0.56 LFEner 1.15 0.55 The most important variable is LVar, which is at the top of both lists. Other important variables appear to be RAve, RMax, RFreq13, LFEner, RFreq14, LFreq7. This would suggest we would want to consider using LVar, RVar, LAve, RAve, LMax, RMax, LFEner, RFEner, and several LFreq, RFreq variables (7,13,14). It seems a bit strange to take the 7,13,14th most high peaks in the periodogram. It may be easier to explain the results if these variables are not used at all. We ll compare classifications with and without these variables. Out of interest we examine the Freq variables more closely. We need to see if a track mostly has Freq values around a similar value. If so, then these variables might be summarized by an average value. Below are parallel coordinate plots of the LFreq and RFreq variables. In the LFreq variables the peaks of rock tracks are mostly at lower frequencies and the peaks of the classical are mostly at the higher frequencies. For the most part these tracks have similar frequencies for the peaks, seen by the mostly parallel lines. A few tracks have large differences in the frequencies of peaks: V1 and Dancing Queen. Similar observations can be made about the RFreq variables, although there are more tracks with varied frequencies: V1, Dancing Queen, SOS, I want to hold you hand, Can t buy me love. It looks like taking an average of these LFreq and RFreq variables may be a reasonable way to reduce the number of variables and remove missing values. 4

There is one missing value left after doing this: B2 on RFreq. For this value we will substitute the LFreq value. This leaves us with these 10 variables to use for the classification: LVar, RVar, LAve, RAve, LMax, RMax, LFEner, RFEner, LFreq, RFreq. 3.2 Summary Statistics There are 30 rock tracks (10 Abba, 10 Beatles, 10 Eels), and 24 classical tracks (10 Vivaldi, 6 Mozart 8 Beethoven). Table?? contains the means and standard deviations of the important variables, broken out by Type of music and Artist. 5

Type LVar RVar LAve RAve LMax RMax LFEner RFEner LFreq RFreq Rock 35.3 30.7-27.2-1.75 27562 27223 107 105 152 177 (31) (27) (39.9) (9.48) (5929) (5882) (3.95) (3.84) (93.9) (155) Classical 5.11 5.21 17.8 17.9 18241 16801 101 102 369 311 (5.7) (5.8) (48.3) (53.7) (8554) (7890) (4.42) (3.86) (195) (130) Abba 6.88 7.37-80.2-4.36 22839 23242 103 103 134 143 Beatles 48.0 40.6-5.97-5.99 28545 27931 110 110 140 208 Eels 51.1 51.0 4.59 5.10 31301 30496 108 107 181 179 Beethoven 7.61 7.58-0.74-0.07 21120 19632 101 101 350 286 Mozart 4.69 5.61-5.94-1.48 18875 18312 101 102 396 353 Vivaldi 3.35 3.07 46.9 51.9 15557 13709 102 102 367 306 Enya 50.3 63.2-11.8-56.3 16063 15921 103 104 95 88 Table 2: Means (Standard deviations) of the variables by type of music, and artist. (* Raised by 10 6 ). 3.3 Plots The plots below show the histograms of the selected variables. The variables with the biggest differences between rock and classical are LVar and RVar. LAve is only important to distinguish Abba tracks from the rest. LMax and RMax have a difference in distribution between the two classes: Rock tracks are more right-skewed, and classical are more uniformly distributed. LFEner and RFEner are surprising: although there appeared to be little differences between the means (Table??) the rock tracks take noticeably larger values than the classical tracks. In LFreq and RFreq the rock tracks are more left-skewed than the classical tracks. It looks like further reducing the variables by half by considering only the left channel variables might be reasonable. 6

The scatterplot matrix below shows the left channel variables. Rock tracks are labeled with +, and classical tracks are labeled with o. The relationships between the variables is important: A combination of LFEner and LFreq almost perfectly separates the two classes. 7

4 Classification The data is broken into 2/3 training and 1/3 test sets based on stratified sampling by artist. There are 10 tracks from each of the Rock CDs, so 7 tracks from each of these are randomly sampled into the training set. There are 10 tracks from Vivaldi, 6 from Mozart and 8 from Beethoven CDs, which are respectively sampled at 7/10, 4/6, 6/8, into the training set. Break data into training and test. The tracks which are in my training set are 1,2,4,6,7,8,10,11,13,14,15,17,19,20,22,23,24,25,27,28,30,32 33,34,35,37,38,41,42,43,44,46,47,48,49,51,53,54. The tracks in the test set are 3,5,9,12,16,18,21,26,29,31,36,39,40,45,50,52. Which classifier should we use? The variance differences between the groups in LVar and LAve would suggest that LDA might not work well. The separations appear to be in combinations of variables, which suggests trees may not work well. Trees are simple so we ll start with them. The results are summarized in Figure??. The tree appears to fit the training data very well, although the second split is too close to the classical tracks. This is a curious choice of splits! Why didn t the algorithm choose a split at LAve=-40? The misclassification table is: Training True/Pred Class Rock Marginal Class 14 3 0.176 Rock 1 20 0.048 0.105 Test True/Pred Class Rock Marginal Class 3 4 0.57 Rock 0 9 0.00 0.250 Random forests do a little better with this data. The training error is 4/38=0.105, and the test error is 2/16=0.125. 8

Figure 1: Summary of the tree classifier. Training True/Pred Class Rock Marginal Class 14 3 0.176 Rock 1 20 0.048 0.105 Test True/Pred Class Rock Marginal Class 5 2 0.286 Rock 0 9 0.00 0.125 Linear discriminant analysis does extremely well with this data. There are 3 errors in the training data, and 0 errors in the test data. The two rock tracks that are misclassified are both Eels tracks, Restraining and Agony. The classical track that is misclassified is the 8th Beethoven track. Tracks that are close to the boundary are Beethoven 7, Vivaldi 9, Mozart 3, The Winner (Abba), Yesterday (Beatles), in the training data, and in the test data, Beethoven 4, Vivaldi 8, The Good Old Days (Eels), Eleanor Rigby (Beatles). Training True/Pred Class Rock Marginal Class 16 1 0.059 Rock 2 19 0.095 0.079 Test True/Pred Class Rock Marginal Class 7 0 0.00 Rock 0 9 0.00 0.00 LDA uses all 5 variables to build its rule. Here are the correlations between the predicted values and each variable: LVar LAve LMax LFEner LFreq 0.63-0.55 0.60 0.76-0.70 The LDA classification rule would be: Assign a new observation, x o to Rock if a x o 15.9 > 0 else assign to Classical, where a = (1.87e 08 2.83e 02 3.94 05 1.47e 01 2.99e 03) We decided not to fit a quadratic discriminant analysis model because the LDA does very well and there is limited data for a more complex model. 9

5 What could I do better? It would be important to go over this report, trim it down and re-do the plots to presentation quality. 10

6 Conclusions The final classifier is the one computed using LDA: Assign a new observation, x o to Rock if a x o 15.9 > 0 else assign to Classical, where a = (1.87e 08 2.83e 02 3.94 05 1.47e 01 2.99e 03) The training error is 8% and test error is 0. The Eels tracks Restraining and Agony and Beethoven s 8th track are misclassified. The most important variables for the classification are LFEner, LFreq, LVar, LMax, LAve. Rock songs have generally higher LFEner and lower LFreq than classical tracks. There are numerous other interesting aspects of the data. Enya tracks are similar to classical in the Ave, Var, Max variables, but more similar to Rock in the Freq variables. Using the LDA rule two are predicted to be classical (The Memory of Trees, Pax Decorum) and one rock (Anywhere Is), and the predicted values are close to the boundary. When Abba tracks are very different from the others in that they have negative average values! This may be a CD effect. We might like to take a look at other Abba CDs to see if this persists. These tracks are unusual: Saturday Morning (Eels), and Vivaldi 6. Saturday Morning has an unusual pattern in the time/frequency plot (appendix) in that it maxes out at the high and low values. This may be due to the axis limits in the plot function. This is a simple study. There is very little data and the tracks were chosen rather than randomly sampled from a larger population. We might use this study to propose hypotheses to test on a larger random sample of tracks. From this small study, it looks like it would be quite viable to develop a classifier for the type of music based on variables created on audio tracks. In a larger study we would want to test for CD effects, for different orchestra renditions of classical music, for other types of music such as country, and jazz. 11

7 Classifying new tracks The 4 new tracks are classified as rock, classical, classical, classical, classical. The first track is strongly predicted to be rock, with predicted value 3.75. The next three tracks are close to the boundary but predicted to be classical. The last track is strongly predicted to be classical with a predicted value -3.20. Examining plots of these new tracks. Track 1 is clearly an Abba song, because it has a low LAve value. Track 5 looks to be clearly a classical song with a very low value of LFEner. But it is an outlier in the plots of the data, that sometimes stays closer to the rock songs. The other three tracks (2,3,4) more consistently stay with the classical tracks. 12

8 References Swayne, D. F., Cook, D., Buja, A., Hofmann, H. and Temple Lang, D. (2005) Interactive and Dynamic Graphics for Data Analysis: With Examples Using R and GGobi, http://www.public.iastate.edu/ dicook/ggobi-book/ggobi.html. Cutler, A. (2005) Random forests http://www.math.usu.edu/ adele/forests/index.htm. Hastie, T., Tibshirani, R., and Friedman, J. (2001) The Elements of Statistical Learning - Data Mining, Inference and Prediction, Springer, New York. R Development Core Team (2003) R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-00-3, http://www.r-project.org. 13

Appendix d.music<-read.csv("music.csv",row.names=1) d.music<-d.music[,c(1:5,36:40,71:72)] write.table(d.music,"music-sub.csv",append=t,quote=f,sep=",", row.names=f,col.names=t) write.table(d.music,"music-full.csv",append=t,quote=f,sep=",", row.names=f,col.names=t) summary(d.music) summary(t(d.music)) # Random forests library(randomforest) music.rf <- randomforest(as.factor(d.music[1:54,2]) ~., data=data.frame(d.music[1:54,-c(1,2)]), importance=true,proximity=true,mtry=3) music.rf <- randomforest(as.factor(d.music[1:54,2]) ~., data=data.frame(d.music[1:54,c(21:35,56:70)]), importance=true,proximity=true,mtry=6) music.rf$importance[order(music.rf$importance[,4],decreasing=t),4:5] music.rf$importance[order(music.rf$importance[,5],decreasing=t),4:5] # It looks like averaging the frequency variable might be a reasonable # approach to dealing with missing values, and reducing the number of # variables. Mostly the tracks have similar values for the frequencies # with highest peaks. LFreq<-apply(d.music[,21:35],1,median,na.rm=T) RFreq<-apply(d.music[,56:70],1,median,na.rm=T) # There is one missing value left, B2 on RFreq. We re going to substitute # the value for the LFreq. RFreq[48]<-LFreq[48] summary(d.music.sub[,1]) summary(d.music.sub[,2]) d.music<-cbind(d.music,lfreq,rfreq) d.music.sub<-d.music[,c(1:5,37:40,72:74)] apply(d.music.sub[d.music.sub[,2]=="rock",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,2]=="rock",-c(1,2)],2,sd) apply(d.music.sub[d.music.sub[,2]=="classical",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,2]=="classical",-c(1,2)],2,sd) apply(d.music.sub[d.music.sub[,1]=="abba",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,1]=="beatles",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,1]=="eels",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,1]=="beethoven",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,1]=="mozart",-c(1,2)],2,mean) apply(d.music.sub[d.music.sub[,1]=="vivaldi",-c(1,2)],2,mean) 14

apply(d.music.sub[d.music.sub[,2]=="enya",-c(1,2)],2,mean) par(mfrow=c(2,2)) hist(d.music.sub[d.music.sub[,2]=="rock",3],col=2,xlim=range(d.music.sub[,3]), xlab=names(d.music.sub)[3],main="rock") hist(d.music.sub[d.music.sub[,2]=="rock",7],col=2,xlim=range(d.music.sub[,7]), xlab=names(d.music.sub)[7],main="rock") hist(d.music.sub[d.music.sub[,2]=="classical",3],col=2, xlim=range(d.music.sub[,3]), xlab=names(d.music.sub)[3],main="classical") hist(d.music.sub[d.music.sub[,2]=="classical",7],col=2, xlim=range(d.music.sub[,7]), xlab=names(d.music.sub)[7],main="classical") hist(d.music.sub[d.music.sub[,2]=="rock",4],col=2,xlim=range(d.music.sub[,4]), xlab=names(d.music.sub)[4],main="rock") hist(d.music.sub[d.music.sub[,2]=="rock",8],col=2,xlim=range(d.music.sub[,8]), xlab=names(d.music.sub)[8],main="rock") hist(d.music.sub[d.music.sub[,2]=="classical",4],col=2, xlim=range(d.music.sub[,4]), xlab=names(d.music.sub)[4],main="classical") hist(d.music.sub[d.music.sub[,2]=="classical",8],col=2, xlim=range(d.music.sub[,8]), xlab=names(d.music.sub)[8],main="classical") hist(d.music.sub[d.music.sub[,2]=="rock",5],col=2,xlim=range(d.music.sub[,5]), xlab=names(d.music.sub)[5],main="rock") hist(d.music.sub[d.music.sub[,2]=="rock",9],col=2,xlim=range(d.music.sub[,9]), xlab=names(d.music.sub)[9],main="rock") hist(d.music.sub[d.music.sub[,2]=="classical",5],col=2, xlim=range(d.music.sub[,5]), xlab=names(d.music.sub)[5],main="classical") hist(d.music.sub[d.music.sub[,2]=="classical",9],col=2, xlim=range(d.music.sub[,9]), xlab=names(d.music.sub)[9],main="classical") hist(d.music.sub[d.music.sub[,2]=="rock",6],col=2,xlim=range(d.music.sub[,6]), xlab=names(d.music.sub)[6],main="rock") hist(d.music.sub[d.music.sub[,2]=="rock",10],col=2,xlim=range(d.music.sub[,10]), xlab=names(d.music.sub)[10],main="rock") hist(d.music.sub[d.music.sub[,2]=="classical",6],col=2, xlim=range(d.music.sub[,6]), xlab=names(d.music.sub)[6],main="classical") hist(d.music.sub[d.music.sub[,2]=="classical",10],col=2, xlim=range(d.music.sub[,10]), xlab=names(d.music.sub)[10],main="classical") hist(d.music.sub[d.music.sub[,2]=="rock",11],col=2,xlim=range(d.music.sub[,11]), xlab=names(d.music.sub)[11],main="rock") hist(d.music.sub[d.music.sub[,2]=="rock",12],col=2,xlim=range(d.music.sub[,12]), xlab=names(d.music.sub)[12],main="rock") hist(d.music.sub[d.music.sub[,2]=="classical",11],col=2, xlim=range(d.music.sub[,11]), xlab=names(d.music.sub)[11],main="classical") hist(d.music.sub[d.music.sub[,2]=="classical",12],col=2, xlim=range(d.music.sub[,12]), xlab=names(d.music.sub)[12],main="classical") 15

pairs(d.music.sub[-c(54:57),c(3:6,11)], pch=as.numeric(d.music.sub[-c(54:57),2])) indx1<-c(sample(c(1:10),7),sample(27:36,7),sample(37:46,7)) indx2<-c(sample(c(11:20),7),sample(21:26,4),sample(47:54,6)) indx<-c(indx1,indx2) sort(indx) [1] 1 2 4 6 7 8 10 11 13 14 15 17 19 20 22 23 24 25 27 28 30 32 33 34 35 [26] 37 38 41 42 43 44 46 47 48 49 51 53 54 c(1:54)[-indx] [1] 3 5 9 12 16 18 21 26 29 31 36 39 40 45 50 52 d.music.train<-d.music.sub[indx,] d.music.test<-d.music.sub[-c(indx,55:57),] #Trees library(rpart) music.rp<-rpart(d.music.train[,2]~.,data.frame(d.music.train[,c(3:6,11)]), method="class",parms=list(split= information )) music.rp table(d.music.train[,2], predict(music.rp,data.frame(d.music.train[,c(3:6,11)]),type="class")) table(d.music.test[,2], predict(music.rp,data.frame(d.music.test[,c(3:6,11)]),type="class")) par(mfrow=c(1,3),pty="m") plot(music.rp) text(music.rp) par(pty="s") plot(d.music.train[,5],d.music.train[,4],type="n",xlab="lmax",ylab="lave", xlim=c(2900,32800),ylim=c(-98,217)) points(d.music.train[d.music.train[,2]=="rock",5], d.music.train[d.music.train[,2]=="rock",4],pch=3) points(d.music.train[d.music.train[,2]=="classical",5], d.music.train[d.music.train[,2]=="classical",4],pch=1) abline(v=27135.5) lines(c(4000,27135.5),c(-5.676338,-5.676338)) title("training data") plot(d.music.test[,5],d.music.test[,4],type="n",xlab="lmax",ylab="lave", xlim=c(2900,32800),ylim=c(-98,217)) points(d.music.test[d.music.test[,2]=="rock",5], d.music.test[d.music.test[,2]=="rock",4],pch=3) points(d.music.test[d.music.test[,2]=="classical",5], d.music.test[d.music.test[,2]=="classical",4],pch=1) abline(v=27135.5) lines(c(4000,27135.5),c(-5.676338,-5.676338)) title("test data") # Random forests library(randomforest) music.rf2 <- randomforest(as.factor(d.music.train[,2]) ~., data=data.frame(d.music.train[,c(3:6,11)]), importance=true,proximity=true,mtry=3) music.rf2 16

table(d.music.train[,2],predict(music.rf2, data=data.frame(d.music.train[,c(3:6,11)]))) table(d.music.test[,2],predict(music.rf2, newdata=data.frame(d.music.test[,c(3:6,11)]),type="class")) music.rf2$importance[order(music.rf2$importance[,4],decreasing=t),4:5] # LDA library(mass) cls<-factor(d.music.train[,2],levels=c("classical","rock")) music.lda<-lda(d.music.train[,c(3:6,11)],cls, prior=c(0.5,0.5)) table(cls, predict(music.lda,d.music.train[,c(3:6,11)],dimen=1)$class) cls2<-factor(d.music.test[,2],levels=c("classical","rock")) table(cls2, predict(music.lda,d.music.test[,c(3:6,11)],dimen=1)$class) music.lda.xtr<-predict(music.lda,d.music.train[,c(3:6,11)],dimen=1)$x music.lda.xts<-predict(music.lda,d.music.test[,c(3:6,11)],dimen=1)$x par(mfrow=c(2,2)) hist(music.lda.xtr[d.music.train[,2]=="classical"],breaks=seq(-6,6,by=0.5), col=2,xlim=c(-6,6),xlab="lda Predicted",main="Classical Training") abline(v=0) hist(music.lda.xts[d.music.test[,2]=="classical"],breaks=seq(-6,6,by=0.5), col=2,xlim=c(-6,6),xlab="lda Predicted",main="Classical Test") abline(v=0) hist(music.lda.xtr[d.music.train[,2]=="rock"],breaks=seq(-6,6,by=0.5), col=2,xlim=c(-6,6),xlab="lda Predicted",main="Rock Training") abline(v=0) hist(music.lda.xts[d.music.test[,2]=="rock"],breaks=seq(-6,6,by=0.5), col=2,xlim=c(-6,6),xlab="lda Predicted",main="Rock Test") abline(v=0) for (i in c(3:6,11)) cat(cor(d.music.train[,i],music.lda.xtr),"\n") mn<-(music.lda$means[1,]+music.lda$means[2,])/2 sum(mn*music.lda$scaling) prd<-as.matrix(d.music.train[,c(3:6,11)])%*%music.lda$scaling-15.9 prd[order(prd)] # Predict Enya predict(music.lda,d.music.sub[55:57,c(3:6,11)],dimen=1)$class predict(music.lda,d.music.sub[55:57,c(3:6,11)],dimen=1)$x # Predict new observations d.music.new<-read.csv("music-new.csv") d.music.new.vars<-cbind(d.music.new[,c(1:3,35)], apply(d.music.new[,19:33],1,median,na.rm=t)) dimnames(d.music.new.vars)[[2]][5]<-"lfreq" predict(music.lda,d.music.new.vars,dimen=1)$class predict(music.lda,d.music.new.vars,dimen=1)$x 17

x<-cbind(rep(na,5),rep(na,5),d.music.new.vars) dimnames(x)[[2]][1]<-"artist" dimnames(x)[[2]][2]<-"type" d.music.plus<-rbind(d.music.sub[,c(1:6,11)],x) x<-as.numeric(d.music.plus[,2]) x[is.na(x)]<-4 pairs(d.music.plus[,3:7],pch=x) write.table(d.music.plus,"music-plusnew-sub.csv",append=t,quote=f, col.names=t,row.names=t,sep=",") Plots of the audio tracks here, but they are available on the course web site. 18