Group A3. Anurag Sharma Shashvat Rai Siddhartha Chatterji Siddharth Raman Singh Nitesh Batra Sandip Chaudhuri. BookCrossing. Data Mining Group Project

Similar documents
Automatic Rhythmic Notation from Single Voice Audio Sources

2018 READER SURVEY REPORT READERS ON READING

Bootstrap Methods in Regression Questions Have you had a chance to try any of this? Any of the review questions?

Using Genre Classification to Make Content-based Music Recommendations

Kindle Romance Top 100s Report

Indie-Lethbridge 2018

Movies Vocabulary and Self-Study Discussion

Enabling editors through machine learning

NETFLIX MOVIE RATING ANALYSIS

CHAPTER 6. Music Retrieval by Melody Style

NANOS. Trudeau sets yet another new high on the preferred PM tracking by Nanos

Composer Style Attribution

Positive trajectory for Trudeau continues hits a twelve month high on preferred PM and qualities of good political leader in Nanos tracking

MATH& 146 Lesson 11. Section 1.6 Categorical Data

Library Supplies Genre Subject Classification Label

Book Review of Rosenhouse, The Monty Hall Problem. Leslie Burkholder 1

2018 Overdrive Selection Guidelines UHLS econtent Advisory Committee

Sitting through commercials: How commercial break timing and duration affect viewership

Lyrics Classification using Naive Bayes

Sentiment Analysis on YouTube Movie Trailer comments to determine the impact on Box-Office Earning Rishanki Jain, Oklahoma State University

INVESTIGATING UNKNOWN IRIG CHAPTER 4, CLASS I OR II FORMATS

Analysis of local and global timing and pitch change in ordinary

Estimation of inter-rater reliability

Trudeau remains strong on preferred PM measure tracked by Nanos

Almost seven in ten Canadians continue to think Trudeau has the qualities of a good political leader in Nanos tracking

Trudeau top choice as PM, unsure second and at a 12 month high

Trudeau scores strongest on having the qualities of a good political leader

NANOS. Trudeau first choice as PM, unsure scores second and at a three year high

MC9211 Computer Organization

Scopus. Advanced research tips and tricks. Massimiliano Bearzot Customer Consultant Elsevier

Trudeau hits 12 month high, Mulcair 12 month low in wake of Commons incident

A year later, Trudeau remains near post election high on perceptions of having the qualities of a good political leader

CHAPTER 2 SUBCHANNEL POWER CONTROL THROUGH WEIGHTING COEFFICIENT METHOD

Honeymoon is on - Trudeau up in preferred PM tracking by Nanos

Analysis of data from the pilot exercise to develop bibliometric indicators for the REF

Introduction to IBM SPSS Statistics (v24)

4.1 GENERATION OF VIGNETTE TEXTS & RANDOM VIGNETTE SAMPLES

arxiv: v1 [cs.sd] 8 Jun 2016

in the Howard County Public School System and Rocketship Education

Evaluation of Serial Periodic, Multi-Variable Data Visualizations

STAT 503 Case Study: Supervised classification of music clips

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

1996 Yampi Shelf, Browse Basin Airborne Laser Fluorosensor Survey Interpretation Report [WGC Browse Survey Number ]

BOOK READING IN NEW ZEALAND

QSched v0.96 Spring 2018) User Guide Pg 1 of 6

Canadian publisher for young adult suspense

Automatic Piano Music Transcription

ECONOMICS 351* -- INTRODUCTORY ECONOMETRICS. Queen's University Department of Economics. ECONOMICS 351* -- Winter Term 2005 INTRODUCTORY ECONOMETRICS

The RTE-3 Extended Task. Hoa Dang Ellen Voorhees

THE FAIR MARKET VALUE

IF MONTY HALL FALLS OR CRAWLS

Wildflower: An Extraordinary Life And Mysterious Death In Africa By Mark Seal READ ONLINE

EE373B Project Report Can we predict general public s response by studying published sales data? A Statistical and adaptive approach

Klee or Kid? The subjective experience of drawings from children and Paul Klee Pronk, T.

NHIH English Language Cable Audience Composition

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Set-Top-Box Pilot and Market Assessment

The basic logic gates are the inverter (or NOT gate), the AND gate, the OR gate and the exclusive-or gate (XOR). If you put an inverter in front of

Moonfleet (Wordsworth Childrens Classics) By John Meade Falkner

Community Orchestras in Australia July 2012

Condorcet Studies II (History Of Philosophy)

Can Song Lyrics Predict Genre? Danny Diekroeger Stanford University

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

Personal Memoirs (Penguin Classics) By Ulysses S. Grant, James M. McPherson READ ONLINE

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

hprints , version 1-1 Oct 2008

Dynamic Map Display in Web OPAC: An Experiment at Wichita State University Libraries

Merriam-Webster's Encyclopedia Of Literature

Vision Call Statistics User Guide

BASE-LINE WANDER & LINE CODING

Understanding Book Popularity on Goodreads

Explore the world of Bookcrossing

InCites Indicators Handbook

Programmer s Reference

Elements: Criteria and Templates

IMDB Movie Review Analysis

YOU ARE WHAT YOU LIKE INFORMATION LEAKAGE THROUGH USERS INTERESTS

Texas Music Education Research

BARB Establishment Survey Annual Data Report: Volume 1 Total Network and Appendices

Into the Depths: The Technical Details Behind AV1. Nathan Egge Mile High Video Workshop 2018 July 31, 2018

Linear mixed models and when implied assumptions not appropriate

Hidden Markov Model based dance recognition

Removing the Pattern Noise from all STIS Side-2 CCD data

156 B. J. Dunne, R. D. Nelson, and R. G. Jahn

Detecting Musical Key with Supervised Learning

Within These Walls: Memoirs Of A Death House Chaplain By Carlton Stowers, Rev. Carroll Pickett

Part One Contemporary Fiction and Nonfiction. Part Two The Humanities: History, Biography, and the Classics

(19) United States (12) Reissued Patent (10) Patent Number:

BARB Establishment Survey Quarterly Data Report: Total Network

PPM Panels: A Guidebook for Arbitron Authorized Users

Enough Already: Winning Your Ugly Struggle With Beauty By Barbara L. Roose READ ONLINE

Analysis and Clustering of Musical Compositions using Melody-based Features

The Relationship Between Movie theater Attendance and Streaming Behavior. Survey Findings. December 2018

TV Demand. MIPTV 2017 Special: Trends for LATIN AMERICA. Kayla Hegedus, Industry Data Scientist

Authentication of Musical Compositions with Techniques from Information Theory. Benjamin S. Richards. 1. Introduction

Distinction: A Social Critique Of The Judgement Of Taste (Routledge Classics) By Pierre Bourdieu

Centre for Economic Policy Research

Navigate to the Journal Profile page

CHAPTER1: Digital Logic Circuits

Feature-Based Analysis of Haydn String Quartets

Transcription:

Group A3 Anurag Sharma Shashvat Rai Siddhartha Chatterji Siddharth Raman Singh Nitesh Batra Sandip Chaudhuri BookCrossing Data Mining Group Project

Executive summary Our Analysis aims at developing a recommendation system for BookCrossing.com an online book sharing community. Using user level book ratings and basic location / demographic information we recommend a two pronged recommendation system. Background and Problem description: BookCrossing is defined as "the practice of leaving a book in a public place to be picked up and read by others, who then do likewise." The term is derived from bookcrossing.com, a free online book club which began in order to encourage the practice, aiming to "make the whole world a library." The exchanging of books may take any of a number of forms, including releasing books in a public place, direct swaps with other members of the website, or "book rings" in which books travel in a set order to participants who want to read a certain book. There are currently 901,229 BookCrossers and 6,725,334 books travelling through 132 countries Essentially each BookCrossing member Registers on the website, providing location, name and age. Picks up a book, reads it and assigns it a rating on the website Releases the book to the next user BookCrossing.com has a great deal of data on book ratings across genres and geographies and basic demographic / geographic information (location and age). At present it has no recommendation systems in place (other than a naïve rule based on the number of users reviewing a particular book and average rating). Our aim was to use BookCrossing data to suggest a recommendation system for new and existing members of the site. Data: Sources and Challenges Our data was obtained from a trawl of the BookCrossing website over a four week period by Carl-Nicolas Zeigler of the University of Freiburg. The raw data consisted of User Id s, ISBN Number, book ratings, location and age for 278,000 users across 10 countries. We however confined ourselves to US data since 30% of BookCrossing members are based in the United s. We also used Amazon.com metadata to obtain genre / category information on the books rated. Tools/ programming languages used:- SAS, Spotfire, XLMiner, Perl script

The process followed in preparing our data is illustrated in Figure 1 below: Step 1 : Keep US Users and Eliminate Books with Missing Ratings Step 2: Match by ISBN to Amazon Metadata for Book Category (Genre) Step 3: Transpose Data to be Unique at the User Id Level Final Data Step 5: Create Like / Dislike Dummy Variable Step 4: Categorize Regions within the US into 10 Federal Regions Figure 1: Data Preparation Process Overview Step 1: The raw data was comma delimited however some internal fields also had commas within them (for example - book titles), this step of the cleaning process was achieved by running search and replace commands using regular expressions. Step 2: This involved collating data for user information (user ID, state and age), books information (book ISBN and name) and ratings data (book ISBN, user ID and book rating) into a common data file. This involved the following set of operations:- - The data from Amazon was in the xml format and a parser was written in Perl to pullout primary category and Amazon sales rank for each book (refer Appendix 1) - A left inner join operation was then performed on the resultant tables using book ISBN as the primary key to collate the data for ratings data and books information; user ID was used as the primary key for the final merge operation with user data. Step 3: Transposition of the data was critical for our analysis and involved significant effort and time. We used SAS, as excel cannot handle transposition operations for a file with 500k rows, to generate genre-based rating columns (categorical) at individual user level for every user rating. Step 4: Dummies were generated for user locations after classifying US states into 10 federal regions. Step 5: Each user rates books on a scale of 0-10, with higher scores being better. We went with the assumption that a person rating a book 5 or higher likes the book.

Accordingly a dummy variable was defined for like/does not like for every category and individual. Finally, we binned continuous variables age into 4 age bins. Step 6: The Final data contained 48,134 ratings by 9998 unique users across the US. Some data snapshots are presented in Appendix 2 The final form in which data was obtain is presented in Figure 2 below: Figure 2: Final Data Format (, Age not shown) Findings from the application of data mining models Since our final data is entirely categorical; hence we applied the following techniques. Association Rules, given the binary matrix form of the data and using the like dummies created for each category. This was to capture category affinity at an individual level without using location and age information. A Naïve Bayes classification for each category / genre, using location and binned age as predictors. An assumption we made to simplify the analysis is that individual title preferences are indicative of genre / category preferences. Association Rules Table 1 : Association Rules Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio 1 56.99 BM_rat, LNF_rat=> BC_rat 479 1587 273 3.586978 BM LNF NOF CB BC MNT ROM RNS SFF Biography & Memoirs Literature and Fiction Non Fiction Childrens Books Book Club Mystery and Thriller Romance Religion and Spirituality Science Fiction & Fantasy 2 56.66 MNT_rat, NOF_rat=> BC_rat 383 1587 217 3.565846 3 50.53 CB_rat, LNF_rat=> BC_rat 469 1587 237 3.180366 4 73.2 BC_rat, ROM_rat=> MNT_rat 306 3126 224 2.338924 5 90.04 HOR_rat, LNF_rat=> MNT_rat 241 3940 217 2.282575 6 67.69 SFF_rat, ROM_rat=> MNT_rat 588 3126 398 2.162695

We retained association rules with a lift over benchmark greater than 2. The confidence cutoff was set to 50% Some combinations recommended are intuitive, for instance Horror and Mystery / Thrillers Others are not so intuitive, Biographies and Book Clubs or Romances and Mystery / Thrillers Naïve Bayes Classification We used a Naïve Bayes model using the like dummy for a category as the dependent variable and location / age as explanatory variables. Results for two categories are presented below (Table 2) Table 2: Naïve Bayes Results for Two Categories Input Variables Number_1 Number_2 Number_3 Number_4 Number_5 Number_6 Number_7 Number_8 Number_9 Number_10 Binned_Age Mystery and Thrillers Classes--> 1 0 Value Prob Value Prob 0 0.982613573 0 0.975173421 1 0.017386427 1 0.024826579 0 0.9326977 0 0.950711939 1 0.067302299 1 0.049288061 0 0.913628716 0 0.939028843 1 0.086371284 1 0.060971157 0 0.895120583 0 0.908360716 1 0.104879417 1 0.091639284 0 0.615255188 0 0.546184739 1 0.384744812 1 0.453815261 0 0.936623668 0 0.921625898 1 0.063376332 1 0.078374102 0 0.94615816 0 0.950468541 1 0.05384184 1 0.049531459 0 0.971396523 0 0.977120604 1 0.028603477 1 0.022879396 0 0.851374089 0 0.886698308 1 0.148625911 1 0.113301692 0 0.9551318 0 0.944626993 1 0.0448682 1 0.055373007 1 0.206954571 1 0.31118413 2 0.198541784 2 0.282828283 3 0.212563096 3 0.173420957 4 0.38194055 4 0.23256663 Input Variables Number_1 Number_2 Number_3 Number_4 Number_5 Number_6 Number_7 Number_8 Number_9 Number_10 Binned_Age Travel Classes--> 1 0 Value Prob Value Prob 0 0.985915493 0 0.976432672 1 0.014084507 1 0.023567328 0 0.971830986 0 0.947326015 1 0.028169014 1 0.052673985 0 0.830985915 0 0.935240205 1 0.169014084 1 0.064759795 0 0.929577465 0 0.905831403 1 0.070422535 1 0.094168597 0 0.52112676 0 0.558767247 1 0.478873239 1 0.441232753 0 0.943661972 0 0.924161547 1 0.056338028 1 0.075838453 0 0.957746479 0 0.949642461 1 0.042253521 1 0.050357539 0 1 0 0.975929097 1 0 1 0.024070903 0 0.873239436 0 0.880451204 1 0.126760563 1 0.119548796 0 0.985915493 0 0.946218149 1 0.014084507 1 0.053781851 1 0.112676056 1 0.293886595 2 0.338028169 2 0.267297814 3 0.267605634 3 0.179776413 4 0.281690141 4 0.259039178 Age Intervals value From 13 28 35 46 To 27 #records 8196 34 7343 45 7790 103 7642 Our dependent variables are the Like dummy for Mystery and Thrillers and Travel For Mystery and Thrillers, Age Bin 4 (Older people) have a higher probability to like the genre. For Travel books, Age Bin 2 (28 to 34) show a higher probability to like the genre.

Federal regions 5 and 9 have the maximum concentration of data, no other significant regional trends were observed. Significant lift over random obtains from the Naïve Bayes rules (Figure 3) Figure 3: Lift from Naïve Bayes Rules for Two Sample Categories Conclusion / Applications We propose a two tier recommendation system for BookCrossing.com based on the following When a user registers, use the Naïve Bayes results to propose genre recommendations based on demographics. Once the user reveals his genre preferences via assigned ratings, use affinity rules to propose more refined recommendations. Future improvements An obvious extension of our analysis is to incorporate data from geographies other than the US. Further analysis can comprise of extending and increasing the level of granularity in recommendation system. For example, moving it to book title level, refining the age bins to gain deeper insights and introducing more levels of rating bins than the current system of like & dislike.

Appendix 1: Data Transformation Appendix 2