NETFLIX MOVIE RATING ANALYSIS

Similar documents
Frequencies. Chapter 2. Descriptive statistics and charts

Chapter 6. Normal Distributions

Release Year Prediction for Songs

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/3

IMDB Movie Review Analysis

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

AP Statistics Sampling. Sampling Exercise (adapted from a document from the NCSSM Leadership Institute, July 2000).

Distribution of Data and the Empirical Rule

Ferenc, Szani, László Pitlik, Anikó Balogh, Apertus Nonprofit Ltd.

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

Introduction to IBM SPSS Statistics (v24)

The Great Beauty: Public Subsidies in the Italian Movie Industry

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Pattern Creator/Converter Software User Manual

What is Statistics? 13.1 What is Statistics? Statistics

THE USE OF RESAMPLING FOR ESTIMATING CONTROL CHART LIMITS

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

BAL Real Power Balancing Control Performance Standard Background Document

ANALYZING CERTAIN TEMPORAL DEPENDENCES IN NETFLIX DATA

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

NAA ENHANCING THE QUALITY OF MARKING PROJECT: THE EFFECT OF SAMPLE SIZE ON INCREASED PRECISION IN DETECTING ERRANT MARKING

Predicting the Importance of Current Papers

Estimating. Proportions with Confidence. Chapter 10. Copyright 2006 Brooks/Cole, a division of Thomson Learning, Inc.

Reliability. What We Will Cover. What Is It? An estimate of the consistency of a test score.

BAL Real Power Balancing Control Performance Standard Background Document

MATH 214 (NOTES) Math 214 Al Nosedal. Department of Mathematics Indiana University of Pennsylvania. MATH 214 (NOTES) p. 1/11

UPDATE TO DOWNSTREAM FREQUENCY INTERLEAVING AND DE-INTERLEAVING FOR OFDM. Presenter: Rich Prodan

Navigate to the Journal Profile page

Authentication of Musical Compositions with Techniques from Information Theory. Benjamin S. Richards. 1. Introduction

Eigenfactor : Does the Principle of Repeated Improvement Result in Better Journal. Impact Estimates than Raw Citation Counts?

Analysis and Clustering of Musical Compositions using Melody-based Features

WEB APPENDIX. Managing Innovation Sequences Over Iterated Offerings: Developing and Testing a Relative Innovation, Comfort, and Stimulation

Performance evaluation of I 3 S on whale shark data

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Comprehensive Citation Index for Research Networks

Discussing some basic critique on Journal Impact Factors: revision of earlier comments

Does Music Directly Affect a Person s Heart Rate?

in the Howard County Public School System and Rocketship Education

GBA 327: Module 7D AVP Transcript Title: The Monte Carlo Simulation Using Risk Solver. Title Slide

COSC282 BIG DATA ANALYTICS FALL 2015 LECTURE 11 - OCT 21

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Seen on Screens: Viewing Canadian Feature Films on Multiple Platforms 2007 to April 2015

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Auto-Teach. Vision Inspection that Learns What a Good Part Is

Intra-frame JPEG-2000 vs. Inter-frame Compression Comparison: The benefits and trade-offs for very high quality, high resolution sequences

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS

Human Hair Studies: II Scale Counts

A Comparison of Methods to Construct an Optimal Membership Function in a Fuzzy Database System

Statistical Consulting Topics. RCBD with a covariate

SWITCHED INFINITY: SUPPORTING AN INFINITE HD LINEUP WITH SDV

Generating Music with Recurrent Neural Networks

For the SIA. Applications of Propagation Delay & Skew tool. Introduction. Theory of Operation. Propagation Delay & Skew Tool

ZX-44XL Liquid Fuel Analyzer. User s Manual Version 1.2

Chapter 1 Midterm Review

Tech Paper. HMI Display Readability During Sinusoidal Vibration

1. Structure of the paper: 2. Title

Using DICTION. Some Basics. Importing Files. Analyzing Texts

Cryptanalysis of LILI-128

Reconstruction of Ca 2+ dynamics from low frame rate Ca 2+ imaging data CS229 final project. Submitted by: Limor Bursztyn

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Outline. Why do we classify? Audio Classification

Automatic Music Clustering using Audio Attributes

Description of Variables

Salt on Baxter on Cutting

THE MONTY HALL PROBLEM

Use black ink or black ball-point pen. Pencil should only be used for drawing. *

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Reviews of earlier editions

Project Summary EPRI Program 1: Power Quality

Moving on from MSTAT. March The University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID

Training Note TR-06RD. Schedules. Schedule types

Figures in Scientific Open Access Publications

Before the Federal Communications Commission Washington, D.C ) ) ) ) ) ) ) ) ) REPORT ON CABLE INDUSTRY PRICES

Enhancing Music Maps

Study on the audiovisual content viewing habits of Canadians in June 2014

Sentiment Analysis on YouTube Movie Trailer comments to determine the impact on Box-Office Earning Rishanki Jain, Oklahoma State University

EE241 - Spring 2013 Advanced Digital Integrated Circuits. Announcements. Lecture 14: Statistical timing Latches

Appendix Y: Queuing Models and Applications

How Large a Sample? CHAPTER 24. Issues in determining sample size

THE EFFECT OF PERFORMANCE STAGES ON SUBWOOFER POLAR AND FREQUENCY RESPONSES

Algebra I Module 2 Lessons 1 19

10GBASE-R Test Patterns

REACHING THE UN-REACHABLE

Lecture 1: Course logistics, homework 0

Detecting Musical Key with Supervised Learning

N12/5/MATSD/SP2/ENG/TZ0/XX. mathematical STUDIES. Wednesday 7 November 2012 (morning) 1 hour 30 minutes. instructions to candidates

Draft 100G SR4 TxVEC - TDP Update. John Petrilla: Avago Technologies February 2014

Citation & Journal Impact Analysis

1. MORTALITY AT ADVANCED AGES IN SPAIN MARIA DELS ÀNGELS FELIPE CHECA 1 COL LEGI D ACTUARIS DE CATALUNYA

E X P E R I M E N T 1

The European Printing Industry Report

The Proportion of NUC Pre-56 Titles Represented in OCLC WorldCat

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

Alternative: purchase a laptop 3) The design of the case does not allow for maximum airflow. Alternative: purchase a cooling pad

LCD and Plasma display technologies are promising solutions for large-format

1/ 19 2/17 3/23 4/23 5/18 Total/100. Please do not write in the spaces above.

STAT 113: Statistics and Society Ellen Gundlach, Purdue University. (Chapters refer to Moore and Notz, Statistics: Concepts and Controversies, 8e)

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Contributions to SE43 Group 10 th Meeting

Transcription:

NETFLIX MOVIE RATING ANALYSIS Danny Dean EXECUTIVE SUMMARY Perhaps only a few us have wondered whether or not the number words in a movie s title could be linked to its success. You may question the relevance this association and cite that a viewer does not have the right to rate a movie on anything other than its content. This is true, but could we potentially find an underlying trend between the number words in a movie s title and the average 5 star rating given by its viewers? Do movies with fewer words in their title end up being better movies than others? Netflix, an online DVD rental service, released a data set consisting over 17,000 movies and their ratings given by customers between 1998 and 2005. Analyzing this data set, I have found that movies with 1, 2, or 3 words in their title account for more than 50% movies. More importantly, those same movies rank in the bottom 33% when looking at their average rating in comparison to movies that have titles with other word counts. Danny Dean - 1

PROBLEM DESCRIPTION The goal my research is to find whether or not a correlation exists between the number words in a movie s title and the average Netflix user rating for that movie. DATA SET DESCRIPTION The Netflix Prize Data Set was initially released to serve as a training data set for the Netflix Prize (see http://www.netflixprize.com). It has since been released, in conjunction with the test data set, to the UCI Machine Learning Repository for data mining. The data set consists over 17,000 movies, 480,000 customers, and 100 million ratings. Attributes available are defined as follows: Movie Each movie is represented by a unique ID and contains information including year release, and title ID Unique integer among all movies Year Release Title Year movie was ficially released Official title the movie Danny Dean - 2

Customer Netflix customers are represented by a unique customer ID ID Unique integer among all customers. Each rating has a movie ID, customer ID, date rating, and value rating (1-5 stars) Movie ID Customer ID Date Rated Value ID an existing movie ID an existing customer The date the customer placed this rating The value the rating (1 5 stars) (1 = Lowest) The ratings were collected between October 1998 and December 2005 and reflect the distribution all ratings received during this period. (UCI) Danny Dean - 3

ANALYSIS TECHNIQUE In order to successfully analyze this data set, I needed to create a custom application written in the C# 3.0 programming language. The application performed its analysis as follows: 1. Selected 1,537 random movies a. A common statistical formula for determining sample size directed me to use 1,537 movies for my analysis. (McClave, 2009) i. Equation: n = (!! )!!! (!")! ii. The value z! is defined as the value the standard normal random variable Z such that the area under the standard normal curve will lie to its right. iii. σ is defined as the standard deviation and can be approximated using σ = R/4 where R is the range observations, which in our case is the range rating values 5 1 = 4 iv. SE is defined as the desired acceptable sampling error, I chose.05 v. n = (!.!")! (!)! (.!")! = 1,536.64 b. C# uses pseudo- random numbers that are chosen with equal probability from a finite set numbers. The chosen numbers are not completely Danny Dean - 4

random because a definite mathematical algorithm is used to select them, but they are sufficiently random for practical purposes. (Microst Corporation) 2. Calculated the number ratings for each movie 3. Calculated the average rating for each movie a. Summation all rating values divided by the number ratings 4. Calculated the standard deviation ratings for each movie a. Equation: s =!!!!!!!! (!!!!!!!!!! )! 5. Grouped movies by word count the title a. The application calculates title word count by splitting the entire title into sections using a single whitespace as a divider. The number sections created is the number words in the title. 6. Calculated the number movies in each group 7. Calculated the average rating movies for each group a. Summation all average ratings divided by the number movies 8. Calculated the average standard deviation movie ratings for each group a. Summation all standard deviations divided by the number movies Danny Dean - 5

9. Displayed a table showing each group and its title word count, number movies, minimum average rating, average (average) rating, maximum average rating, minimum standard deviation, average standard deviation and maximum standard deviation ASSUMPTIONS The data set maintains integrity Users vote truthfully All ratings between 1998 and 2005 are actually represented Words are never separated by more than a single whitespace Danny Dean - 6

RESULTS My custom application produced the following tables after roughly 5 minutes heavy calculation. Please note that tables are sorted on the average rating column, high to low. 1 ST ITERATION Rank Word Count Number Movies s s s 1 13 3 3.0531 3.6326 3.93752 0.98831 1.0732 1.20149 2 7 69 2.18548 3.47063 4.2 0.89694 1.15593 1.58683 3 9 23 2.20238 3.46036 4.46482 0.91425 1.16488 1.46528 4 10 9 2.78 3.43411 3.95552 0.98249 1.09407 1.16816 5 6 108 1.4 3.41791 4.25 0.76168 1.14549 1.54909 6 4 221 1.6699 3.32801 4.4839 0.84119 1.10333 1.43974 7 5 195 1.74194 3.31637 4.42814 0.85603 1.13473 1.54905 8 12 8 2.12821 3.2826 4.09787 1.00122 1.16659 1.34932 9 8 41 1.53801 3.28169 4.46073 0.79859 1.18768 1.42835 10 3 302 1.49624 3.22392 4.31399 0.74492 1.09592 1.54804 11 2 354 1.72414 3.09514 4.1426 0.83197 1.06617 1.47119 12 1 201 1.65546 3.08871 4.14223 0.83805 1.08277 1.40304 13 11 3 2.41975 2.71712 2.97059 1.09389 1.14149 1.16629 QUICK ANALYSIS Movies with 1, 2 or 3 words in their title account for more than 50% the 1,537 movies sampled. They also score in the bottom 33% in regards to average rating. Danny Dean - 7

2 ND ITERATION Rank Word Count Number Movies s s s 1 13 3 3.43694 3.61706 3.90718 1.02981 1.17496 1.35625 2 18 1 3.60377 3.60377 3.60377 1.09789 1.09789 1.09789 3 7 58 1.92754 3.40083 4.25401 0.885 1.17033 1.47626 4 11 6 2.97403 3.38544 4.2735 0.97656 1.17269 1.41325 5 6 111 1.90244 3.38472 4.38008 0.82449 1.15591 1.54909 6 9 28 1.9469 3.34991 4.14086 0.99594 1.126 1.34626 7 4 231 1.95122 3.29863 4.59551 0.77926 1.12546 1.4762 8 5 173 1.81034 3.29477 4.51601 0.71354 1.12616 1.45519 9 8 53 1.53801 3.27189 4.00284 0.88949 1.18858 1.55593 10 3 316 1.6622 3.23906 4.67099 0.68251 1.08072 1.45177 11 10 11 2.22196 3.16551 3.5998 1.03038 1.09758 1.23482 12 15 1 3.16412 3.16412 3.16412 1.18733 1.18733 1.18733 13 2 351 1.6 3.12298 4.25188 0.80883 1.06803 1.42905 14 1 188 1.76829 3.11148 4.44671 0.8135 1.07134 1.37667 QUICK ANALYSIS Again, movies with 1, 2, or 3 words in their title account for more than 50% the 1,537 movies sampled. They score in the bottom 36% in regards to average rating. Danny Dean - 8

3 RD ITERATION Rank Word Count Number Movies s s s 1 13 3 2.75862 3.50678 3.93752 0.98831 1.07593 1.22063 2 12 4 3.25 3.46214 3.60117 1.05637 1.19329 1.36744 3 7 61 2.0102 3.45853 4.28295 0.885 1.15672 1.52658 4 8 40 2.36842 3.41979 4.26608 0.93875 1.14182 1.54406 5 6 112 2.16296 3.38283 4.33596 0.91927 1.15293 1.48393 6 14 2 3.01887 3.38055 3.74222 1.03262 1.15277 1.27292 7 5 189 1.76596 3.36516 4.6 0.71354 1.13672 1.58659 8 4 244 1.49776 3.29792 4.44833 0.82299 1.11032 1.42609 9 9 19 2.20238 3.29497 3.82423 1.00256 1.1556 1.31072 10 10 16 2.33884 3.2944 3.74044 0.98626 1.15836 1.41747 11 3 296 1.90576 3.25164 4.40801 0.85114 1.09642 1.3996 12 11 6 2.41975 3.21774 3.92373 1.0251 1.1662 1.38752 13 15 1 3.16412 3.16412 3.16412 1.18733 1.18733 1.18733 14 1 193 1.76829 3.15569 4.52261 0.8135 1.0801 1.34483 15 2 351 1.72115 3.08266 4.40408 0.81677 1.07193 1.5013 QUICK ANALYSIS For a third consecutive time, movies with 1, 2, or 3 words in their title account for more than 50% the 1,537 movies sampled. They again score in the bottom 33% in regards to average rating. SUMMARY: Interestingly enough, the group movies with 13 words in their title topped the charts over each iteration. Movies with 1, 2, or 3 words in their title consistently showed up at the bottom the table with regard to average rating. Movies with 4 or 5 words in there title maintained a position in the middle the list (50 th percentile) over each iteration. Danny Dean - 9

ISSUES Size data set o The 4GB data set made it impossible to analyze the entire data set using my computer. Organization data o The data set consisted over 17,000 text files. It had one text file containing all movie IDs, year releases, and titles. In addition, each movie had its own text file consisting all ratings for that movie. Lack existing tool o Due to the organization the data set, I decided it would be more practical to develop a custom tailored application rather than trying to use an existing analytical stware package. REFERENCES McClave, S. (2009). A First Course in Statistics. Person Education, Inc. Microst Corporation. (n.d.). MSDN. Retrieved December 05, 2009, from Random Class: http://msdn.microst.com/en- us/library/system.random(vs.71).aspx UCI. (n.d.). UCI Machine Learning Repository. Retrieved December 12, 2009, from Netflix Prize Data Set: http://archive.ics.uci.edu/ml/datasets/netflix+prize Danny Dean - 10