A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Similar documents
CS229 Project Report Polyphonic Piano Transcription

Release Year Prediction for Songs

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Distortion Analysis Of Tamil Language Characters Recognition

Modeling memory for melodies

NETFLIX MOVIE RATING ANALYSIS

What is Statistics? 13.1 What is Statistics? Statistics

Research Article. ISSN (Print) *Corresponding author Shireen Fathima

Chord Classification of an Audio Signal using Artificial Neural Network

Detecting Musical Key with Supervised Learning

PICK THE RIGHT TEAM AND MAKE A BLOCKBUSTER A SOCIAL ANALYSIS THROUGH MOVIE HISTORY

Neural Network for Music Instrument Identi cation

How to Obtain a Good Stereo Sound Stage in Cars

2. Problem formulation

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Study of White Gaussian Noise with Varying Signal to Noise Ratio in Speech Signal using Wavelet

Music Genre Classification and Variance Comparison on Number of Genres

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

On-Supporting Energy Balanced K-Barrier Coverage In Wireless Sensor Networks

Supplementary Note. Supplementary Table 1. Coverage in patent families with a granted. all patent. Nature Biotechnology: doi: /nbt.

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

Normalization Methods for Two-Color Microarray Data

On the Characterization of Distributed Virtual Environment Systems

Open Access Determinants and the Effect on Article Performance

Melody Extraction from Generic Audio Clips Thaminda Edirisooriya, Hansohl Kim, Connie Zeng

Alfonso Ibanez Concha Bielza Pedro Larranaga

NEXTONE PLAYER: A MUSIC RECOMMENDATION SYSTEM BASED ON USER BEHAVIOR

LCD and Plasma display technologies are promising solutions for large-format

A Study of Predict Sales Based on Random Forest Classification

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

Algebra I Module 2 Lessons 1 19

GfK Audience Measurements & Insights FREQUENTLY ASKED QUESTIONS TV AUDIENCE MEASUREMENT IN THE KINGDOM OF SAUDI ARABIA

Set-Top-Box Pilot and Market Assessment

Visual Encoding Design

Neural Network Predicating Movie Box Office Performance

Post-Routing Layer Assignment for Double Patterning

About Giovanni De Poli. What is Model. Introduction. di Poli: Methodologies for Expressive Modeling of/for Music Performance

Dual-V DD and Input Reordering for Reduced Delay and Subthreshold Leakage in Pass Transistor Logic

1. MORTALITY AT ADVANCED AGES IN SPAIN MARIA DELS ÀNGELS FELIPE CHECA 1 COL LEGI D ACTUARIS DE CATALUNYA

Where to present your results. V4 Seminars for Young Scientists on Publishing Techniques in the Field of Engineering Science

Chapter 6. Normal Distributions

APPLICATION OF MULTI-GENERATIONAL MODELS IN LCD TV DIFFUSIONS

More About Regression

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

from ocean to cloud ADAPTING THE C&A PROCESS FOR COHERENT TECHNOLOGY

OPERATIONS SEQUENCING IN A CABLE ASSEMBLY SHOP

Figures in Scientific Open Access Publications

DOES MOVIE SOUNDTRACK MATTER? THE ROLE OF SOUNDTRACK IN PREDICTING MOVIE REVENUE

Retiming Sequential Circuits for Low Power

ANALYZING CERTAIN TEMPORAL DEPENDENCES IN NETFLIX DATA

SYMPHONY OF THE RAINFOREST Part 2: Soundscape Saturation

Multi-Shaped E-Beam Technology for Mask Writing

Rubato: Towards the Gamification of Music Pedagogy for Learning Outside of the Classroom

Discrete, Bounded Reasoning in Games

BIBLIOMETRIC REPORT. Bibliometric analysis of Mälardalen University. Final Report - updated. April 28 th, 2014

Machine Vision System for Color Sorting Wood Edge-Glued Panel Parts

Comprehensive Citation Index for Research Networks

Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj

Measuring the Impact of Electronic Publishing on Citation Indicators of Education Journals

Broken Wires Diagnosis Method Numerical Simulation Based on Smart Cable Structure

A Pattern Recognition Approach for Melody Track Selection in MIDI Files

LSTM Neural Style Transfer in Music Using Computational Musicology

Centre for Economic Policy Research

Creating a Feature Vector to Identify Similarity between MIDI Files

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Music Genre Classification

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

Lyrics Classification using Naive Bayes

Feature-Based Analysis of Haydn String Quartets

Predicting the immediate future with Recurrent Neural Networks: Pre-training and Applications

in the Howard County Public School System and Rocketship Education

Principles of Video Compression

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Regression Model for Politeness Estimation Trained on Examples

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

DATA COMPRESSION USING THE FFT

Privacy Level Indicating Data Leakage Prevention System

Adaptive Key Frame Selection for Efficient Video Coding

Singing voice synthesis based on deep neural networks

Music Composition with RNN

Seen on Screens: Viewing Canadian Feature Films on Multiple Platforms 2007 to April 2015

Noise (Music) Composition Using Classification Algorithms Peter Wang (pwang01) December 15, 2017

Sarcasm Detection in Text: Design Document

Lessons from the Netflix Prize: Going beyond the algorithms

Low Power Estimation on Test Compression Technique for SoC based Design

Reproducibility Assessment of Independent Component Analysis of Expression Ratios from DNA microarrays.

Efficient Implementation of Neural Network Deinterlacing

White Paper. Uniform Luminance Technology. What s inside? What is non-uniformity and noise in LCDs? Why is it a problem? How is it solved?

Analysis of local and global timing and pitch change in ordinary

Relationships Between Quantitative Variables

The Great Beauty: Public Subsidies in the Italian Movie Industry

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Interactive Classification of Sound Objects for Polyphonic Electro-Acoustic Music Annotation

Joint Image and Text Representation for Aesthetics Analysis

Creating Data Resources for Designing User-centric Frontends for Query by Humming Systems

CSE 101. Algorithm Design and Analysis Miles Jones Office 4208 CSE Building Lecture 9: Greedy

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

THANK YOU FOR PURCHASING HANYOUNG PRODUCT. PLEASE READ THIS MANUAL CAREFULLY.

Transcription:

A combination of approaches to solve Tas How Many Ratings? of the KDD CUP 2007 Jorge Sueiras C/ Arequipa +34 9 382 45 54 orge.sueiras@neo-metrics.com Daniel Vélez C/ Arequipa +34 9 382 45 54 José Luis Flórez C/ Arequipa +34 9 382 45 54 daniel.velez@neometrics.com ose.luis.florez@neometrics.com ABSTRACT This paper presents a solution to the KDD CUP 2007 tas How Many Ratings?. The combination of three different approaches is used to produce a final solution which improves the results obtained by each of these procedures by itself. Categories and Subect Descriptors I.5. [Pattern Recognition]: Models statistical Keywords Predictive modeling, forecasting, data mining.. INTRODUCTION The KDD CUP 2007 tas 2 is based on a competition proposed by Netflix (http://www.netflixprize.com). For the Netflix competition a training data set of more than 00 million ratings associated to user-movie pairs is provided []. This data was collected between October 998 and December 2005. The aim of the contest is to estimate around two million ratings achieving an average prediction error lower than a prefixed value. The purpose of this second tas of the KDD competition was to forecast the number of ratings to be obtained during 2006 by 8.863 movies randomly chosen from the Netflix data set. An important constraint for this tas was that only ratings given by users existing in the Netflix data file could be taen into account, that is, ratings of users registered in 2006 were not considered. To accomplish such goal, three methodologies were developed: Memory-Based Reasoning Techniques: Producing an expected value for a movie in 2006 computed by a weighted sum of expected values in 2005 associated with movies showing similar behaviors. ARMA models: Adusted to the time series defined by monthly movie ratings. Each of them was transformed to avoid being biased by new-user effects. The third methodology was a basic procedure which allowed estimation of rating percentages based on those registered for the previous year. All the movies whose ratings started in the same month were included in the same group. None of these methods provided an optimal result, but their estimations, in addition to other factors, provided an interesting vector of exogenous variables for the construction of a final model. Next, all methodologies introduced above will be described. Special attention will be paid to how the pronounced drop in the number of of new users at the end of 2005 was taen into account. 2. FIRST STEP: ANALYSIS OF NETFLIX RATING DATA An important property of the Netflix data file is the drastic reduction observed in the number of users and movies which began to review or to be reviewed respectively, at the end of 2005. The following figures (Figure and 2) show both, the number of users and movies, relative to their starting month. In the first case (number of users), a gradual descent along the latest two months can be observed, while in the second (number of movies), the value is almost zero in the latest two months. Permission to mae digital or hard copies of all or part of this wor for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDDCup 07, August 2, 2007, San Jose, California, USA. Copyright 2007 ACM 978--59593-834-3/07/0008 $5.00. Figure. Number of new users by month. 75

Figure 2. Number of new movies by month. The effect of this drop is very important, since ratings in 2006 will be expected to decrease, not only because of the fact that 2006 new users will not be counted, but also because the number of users that appeared at the end of 2005 is considerably lower than expected. To solve this problem, the number of users which should have qualified was estimated, comparing month by month new user percentages in 2005 and the average value associated to the 2002-2004 period. Computing the difference between estimated and real data (stored in the Netflix training dataset), historical data were corrected, eliminating an identical percentage of new users in the final months of the previous years. The next table (Table ) shows the percentages of users which started their reviews within a month that were eliminated in order to emulate the behavior observed in 2005 according to the latest months of the analyzed period. This procedure was adapted for movies and applied to the KDD CUP 2007 tas. A full description of it can be found in the paper associated to tas. Table. Percentage of users left out by month. Starting month Users December 9.3% November 68% 3. SOLUTION A: K- NEIGHBOURS METHOD The K-neighbors method represent movies as vectors. The values contained in these vectors, mae reference to the number of monthly ratings, excluding reviews given by users registered on the year associated to the month. These vectors have been split in two parts: The first part is composed by 2 values, representing the evolution of the number of ratings given by users to each movie during year 2004. The second part is defined as the image of the previous vector. It was calculated as the sum of the ratings obtained by the movie in 2005. Based on these vectors, the expected value for a movie in 2006 after a sequence of values S2004 representing the number of ratings in 2005, has been computed as a weighted mean of the images corresponding to the vectors whose distances to S2004 are the smallest. The weights were defined in relation to the distances between S2004 and the vectors associated to its neighbors. 3. Algorithm used Let T Movie, Year be the vector defined by the monthly ratings received by a movie throughout a given year: T = ( T (),..., T ( 2) ) Movie, Year Movie, Year Movie, Year Suppose we wish to compute the number of ratings Movie will obtain throughout 2006. 3.. Step : Search for neighbors Starting with the number of monthly ratings obtained by Movie in 2005, T Movie, 2005, the closest traectories to Movie are considered by computing Euclidean distances with respect to numbers of movie ratings from 2004 d = T T {,2,..., } Movie,2005 Movie_,2004 2 3..2 Step 2: Prediction computation The expected value for Movie in 2006, P Movie, 2006, is established as the weighted mean of expected value sums for 2005 of the previous traectories from 2004, where weighting is established as a function of the distance vector: where S Movie Movie_,2005 = d Movie,2006 = P 2 * S = d _,2005 = TMovie_,2005( i) {,2,..., } i= To finish with this methodology, some comments related to cautionary measures taen are presented: Search for traectories neighboring a given one for a specific year is performed by looing for similar traectories from the previous year. Since traectory variability increases with time (heterosedasticity effect ), it was decided to perform a prior logarithmic transformation on the traectories in order to enable comparisons among them. In addition, applying this transformation was consistent with the error measure used for model quality assessment. 76

Both the highest and the lowest image were suppressed in the estimation of the weighted mean in order to avoid possible bad influence of movies with extreme images. Finally, the forecast error achieved by this method over the scoring data was 0.5828. 4. SOLUTION B: ARMA METHOD This approach adusts ARMA models [2] to as many monthly rating time series as movies. Once again only ratings given by users registered before the beginning of the year were taen into account. All the series were smoothed, eliminating the trend effect caused by the annual increase of users and correcting monthly level shifts. The user registration effect was suppressed by computing annual factors as the ratio between existing users in the considered year and users existing at the end of 2005. Monthly level shifts were corrected by using monthly factors which allowed an increase in the number of ratings in months where this quantity had been lower or decreased otherwise. In a last stage, ARMA models were adusted to these smoothed series. The procedure was only applied to movies over one year old. The ARMA model finally chosen for the fit was an ARMA(,): ( φ B )log( X Movie, t ) = ( + θb) ε Movie, t X where Movie, t defines the time series associated with the number of rating obtained by the movie within a time unit (a month), appropriately smoothed out as described, and B is the lagging operator. In this case, the forecast error achieved over the scoring data was 0.9485. This method s poor performance is probably due to generally short data histories available for series construction. 5. SOLUTION C: FALLING CURVES METHOD The third method is quite simple and does not use any mathematical model. The monthly behavior of ratings corresponding to different movies was analyzed, excluding ratings given by new users registered on the year associated to the month under analysis. For instance, next figure (Figure 3) shows monthly average percentage of ratings during 2005 for movies that began to be rated prior to January 2005, by users registered prior to January 2005. The noticeable decreasing trend shown at the end of Figure 3 gives its name to this method. Figure 3. Falling curves applied to each one of the five groups of movies. The previous observation led us to the following hypothesis: The way in which the ratings of a movie decrease in the following year is similar for movies with a similar first review date. So, five groups of movies were distinguished depending on their ages (the age of a movie is defined as number of months since its first review): Under six-month-old movies. Seven or eight-month-old movies. Nine or ten-month-old movies. Eleven or twelve-month-old movies. Over one-year-old movies. For each of these groups, the respective percentages of ratings for the next twelve months was computed according to the behavior observed in 2005. These percentages were used to estimate the number of reviews expected in 2006. The forecast error obtained with this basic procedure over the scoring data was 0.900. 6. FINAL MODEL The definitive model was built as follows:. Netflix data set was split into two subsets. The first one contained movies whose ratings were made prior to January 2005, while the other included the remaining ones. 2. In the first subset, ratings given by users who began their reviews in the latest months were excluded ust as it has been previously explained. In addition to the predictions given by each one of the mentioned solutions, this subset was used to generate different input variables. 3. The second subset was used to count the ratings given in 2005 by the users existing in the first subset. The 77

resulting values (x) were converted by logarithmic transformation (log(x+)). 4. A data set with input and target variables associated to approximately 4.000 movies which began to be reviewed prior to January 2005, was obtained by oining these subsets. The resulting data set was then partitioned in two tables of equal dimensions which were considered as training and testing data for the final model. The following variables were included in that model: Log(x+), where x is the rating forecast provided by - neighbour method for 2005. Log(x+), where x is the rating forecast provided by ARMA method for 2005. Log(x+), where x is the rating forecast provided by falling curves method for 2005. Log(x+), where x is the total number of reviews given to a movie since it was registered. Number of months since first review. Percentage of low scores ( Stars = ) given by users. Percentage of high scores ( Stars = 5 ) given by users. Average rating of the movie. Standard deviation of ratings associated to the movie. Number of months since last review. Percentage of reviews given in the last year. Percentage of reviews given in the last three months regarding the reviews given in the last year. Ratio between reviews of the last year and the previous one. Percentages of reviews given by users who have been scoring for more than a year. The final model implemented consisted in a neural networ model with perceptron architecture [4] and a hidden layer integrated by 5 nodes. The forecast error achieved over testing data with this model was 0.49. Given that the final error over the scoring data was 0.5227 the model was considered to be overfitted. The reason for this could be the reduced number of movies (around 7.000) that too part on the training. To end with, two graphs were produced. The first one was residual against forecast using the training set and the second one residual against forecast using the scoring set (Figures 4 and 5). Figure 4 shows that the residuals have been properly adusted in the first case. However, Figure 5 shows that movies with the highest levels of ratings have been overestimated over scoring data. Figure 4. Residual vs. Predicted over training data. Figure 5. Residual vs. Predicted over scoring data. 7. CONCLUSIONS In this paper we present our solution to the KDD CUP 2007 tas two. In our opinion, there are two basic issues that must be considered in order to achieve a good solution to the problem. The first of them was the fact that possible reviews given by new users were not counted. Moreover when these users were the ones who gave a higher number of ratings. Because of this, monthly ratings show an obvious decreasing trend that must be taen into account. The second issue was that the entry data set must reflect the noticeable fall of new users at the end of 2005. Overlooing these effects, would greatly increase the average error achieved by the approaches reviewed. Different mixed models were built by combining predictions resulting from some of the three outlined methodologies. We observed that the absence of any of the predictions would have generated a result considerably worse than the solution finally submitted. 78

8. ACKNOWLEDGMENTS We are very grateful to for the support they have given us since the beginning of this proect. In particular we are in debt with Ana Alvarez, Natalia Molina, Maria Sala and Maria Sanchez for their efforts put forth in order to solve this tas, and Juan-Carlos Ibañez and Fausto Morales for his help in writing this paper. We would also lie to express our thans to the KDD CUP organizers for the wor they have carried out. 9. REFERENCES [] J. Bennet and S Lanning. The Netflix prize. KDD Cup and Worshop 2007, San Jose, California, Aug 2, 2007 [2] G. E. P. Box and G. M. Jenins, editors. Time Series Analysis: Forecasting and Control. San Francisco: Holden- Day, 970. [3] E. Castillo, A. J. Coneo, P. Pedregal, R. Garcia, and N. Alguacil. Building and Solving Mathematical Programming Models in Engineering and Science. Pure and Applied Mathematics: A Wiley-Interscience Series of Texts, Monographs and Tracts, 200. [4] R. O. Duda, P. Hart, and D. G. Stor. Pattern classification. Wiley, New Yor, 200. [5] T. Hastie, R. Tibshirani, and J. H. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 200. 79