Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of CS

Similar documents
Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

BBM 413 Fundamentals of Image Processing Dec. 11, Erkut Erdem Dept. of Computer Engineering Hacettepe University. Segmentation Part 1

Detecting Medicaid Data Anomalies Using Data Mining Techniques Shenjun Zhu, Qiling Shi, Aran Canes, AdvanceMed Corporation, Nashville, TN

Supervised Learning in Genre Classification

Automatic Music Clustering using Audio Attributes

Advanced Data Structures and Algorithms

DAY 1. Intelligent Audio Systems: A review of the foundations and applications of semantic audio analysis and music information retrieval

NETFLIX MOVIE RATING ANALYSIS

SpikePac User s Guide

Design Project: Designing a Viterbi Decoder (PART I)

Why t? TEACHER NOTES MATH NSPIRED. Math Objectives. Vocabulary. About the Lesson

AP Statistics Sec 5.1: An Exercise in Sampling: The Corn Field

Multiple Strategies to Analyze Monty Hall Problem. 4 Approaches to the Monty Hall Problem

Lecture 5: Clustering and Segmentation Part 1

FIR Center Report. Development of Feedback Control Scheme for the Stabilization of Gyrotron Output Power

A Dominant Gene Genetic Algorithm for a Substitution Cipher in Cryptography

Interactive Methods in Multiobjective Optimization 1: An Overview

Analysis and Clustering of Musical Compositions using Melody-based Features

Post-Routing Layer Assignment for Double Patterning

A TEXT RETRIEVAL APPROACH TO CONTENT-BASED AUDIO RETRIEVAL

Hidden Markov Model based dance recognition

Measuring Variability for Skewed Distributions

MusCat: A Music Browser Featuring Abstract Pictures and Zooming User Interface

Quiz #4 Thursday, April 25, 2002, 5:30-6:45 PM

A Color Gamut Mapping Scheme for Backward Compatible UHD Video Distribution

1-5 Square Roots and Real Numbers. Holt Algebra 1

(Skip to step 11 if you are already familiar with connecting to the Tribot)

WHAT MAKES FOR A HIT POP SONG? WHAT MAKES FOR A POP SONG?

Music Information Retrieval with Temporal Features and Timbre

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Automated Accompaniment

Authorship Verification with the Minmax Metric

ISSN: ISO 9001:2008 Certified International Journal of Engineering and Innovative Technology (IJEIT) Volume 7, Issue 12, June 2018

Enhancing Music Maps

BIBLIOGRAPHIC DATA: A DIFFERENT ANALYSIS PERSPECTIVE. Francesca De Battisti *, Silvia Salini

Modelling Intervention Effects in Clustered Randomized Pretest/Posttest Studies. Ed Stanek

Interactive Decomposition Multi-Objective Optimization via Progressively Learned Value Functions

6.UAP Project. FunPlayer: A Real-Time Speed-Adjusting Music Accompaniment System. Daryl Neubieser. May 12, 2016

Department of CSIT. Class: B.SC Semester: II Year: 2013 Paper Title: Introduction to logics of Computer Max Marks: 30

Chapter 7: RV's & Probability Distributions

Lesson 2.2: Digitizing and Packetizing Voice. Optimizing Converged Cisco Networks (ONT) Module 2: Cisco VoIP Implementations

CS302 Digital Logic Design Solved Objective Midterm Papers For Preparation of Midterm Exam

LabView Exercises: Part II

Cluster Analysis of Internet Users Based on Hourly Traffic Utilization

WHAT'S HOT: LINEAR POPULARITY PREDICTION FROM TV AND SOCIAL USAGE DATA Jan Neumann, Xiaodong Yu, and Mohamad Ali Torkamani Comcast Labs

Flip-flop Clustering by Weighted K-means Algorithm

11.1 As mentioned in Experiment 10, sequential logic circuits are a type of logic circuit where the output

MPEG-7 AUDIO SPECTRUM BASIS AS A SIGNATURE OF VIOLIN SOUND

For an alphabet, we can make do with just { s, 0, 1 }, in which for typographic simplicity, s stands for the blank space.

1/8. Axioms of Intuition

Package Polychrome. R topics documented: November 20, 2017

Package ForImp. R topics documented: February 19, Type Package. Title Imputation of Missing Values Through a Forward Imputation.

Multiple-point simulation of multiple categories Part 1. Testing against multiple truncation of a Gaussian field

DIGITAL ELECTRONICS & it0203 Semester 3

Classification of Timbre Similarity

Chapter 5. Describing Distributions Numerically. Finding the Center: The Median. Spread: Home on the Range. Finding the Center: The Median (cont.

Subjective Similarity of Music: Data Collection for Individuality Analysis

The absolute opposite of ordinary

Washington Metropolitan Area Transit Authority (WMATA) Ridership

Multiband Noise Reduction Component for PurePath Studio Portable Audio Devices

Algorithmically Flexible Style Composition Through Multi-Objective Fitness Functions

A combination of approaches to solve Task How Many Ratings? of the KDD CUP 2007

Gossip Spread in Social Network Models

Game of Life music. Chapter 1. Eduardo R. Miranda and Alexis Kirke

Heuristic Search & Local Search

Instrument Recognition in Polyphonic Mixtures Using Spectral Envelopes

Proposed reference equalizer change in Clause 124 (TDECQ/SECQ. methodologies).

1. Introduction. Abstract. 1.1 Logic Criteria

Minimailer 4 OMR SPECIFICATION FOR INTELLIGENT MAILING SYSTEMS. 1. Introduction. 2. Mark function description. 3. Programming OMR Marks

CS229 Project Report Polyphonic Piano Transcription

Problem Points Score USE YOUR TIME WISELY USE CLOSEST DF AVAILABLE IN TABLE SHOW YOUR WORK TO RECEIVE PARTIAL CREDIT

UWE has obtained warranties from all depositors as to their title in the material deposited and as to their right to deposit such material.

Personalized TV Recommendation with Mixture Probabilistic Matrix Factorization

Sample Design and Weighting Procedures for the BiH STEP Employer Survey. David J. Megill Sampling Consultant, World Bank May 2017

1/ 19 2/17 3/23 4/23 5/18 Total/100. Please do not write in the spaces above.

Chinese Remainder Theorem-Based Sequence Design for Resource Block Assignment in Relay-Assisted Internet-of-Things Communications

Programmer s Reference

Lesson 7: Measuring Variability for Skewed Distributions (Interquartile Range)

Volume Trigger Proposal for the 2011 Season for horizontal low Energy events

Identifying Early Adopters, Enhancing Learning, and the Diffusion of Agricultural Technology

Generating Music with Recurrent Neural Networks

Evolutionary Computation Applied to Melody Generation

Lecture 9 Source Separation

Topic 10. Multi-pitch Analysis

Subject-specific observed profiles of change from baseline vs week trt=10000u

Implementation of BIST Test Generation Scheme based on Single and Programmable Twisted Ring Counters

Music Segmentation Using Markov Chain Methods

ORTHOGONAL frequency division multiplexing

BEAMAGE 3.0 KEY FEATURES BEAM DIAGNOSTICS PRELIMINARY AVAILABLE MODEL MAIN FUNCTIONS. CMOS Beam Profiling Camera

Implementation of a turbo codes test bed in the Simulink environment

Exploring Architecture Parameters for Dual-Output LUT based FPGAs

Reducing IPTV Channel Zapping Time Based on Viewer s Surfing Behavior and Preference

Agilent MOI for HDMI 1.4b Cable Assembly Test Revision Jul 2012

2. ctifile,s,h, CALDB,,, ACIS CTI ARD file (NONE none CALDB <filename>)

Homework Packet Week #5 All problems with answers or work are examples.

ATSC Candidate Standard: Video Watermark Emission (A/335)

Musical Representations of the Fibonacci String and Proteins Using Mathematica

Energy Efficiency Labelling for Televisions A guide to the Commission Delegated Regulation (EU) 1062/2010

Operating Bio-Implantable Devices in Ultra-Low Power Error Correction Circuits: using optimized ACS Viterbi decoder

Route optimization using Hungarian method combined with Dijkstra's in home health care services

Transcription:

Data Mining Dr. Raed Ibraheem Hamed University of Human Development, College of Science and Technology Department of CS 2016 2017

Road map Common Distance measures The Euclidean Distance between 2 variables K-means Clustering How the K-Mean Clustering algorithm works? Step 1: Step 2: Step 3: Step 4: More examples of K-Mean Clustering Demonstration of PAM Steps of PAM Department of CS - DM - UHD 2

Common Distance measures: Distance measure will determine how the similarity of two elements is calculated and it will influence the shape of the clusters. They include: 1. The Euclidean distance (also called 2-norm distance) is given by: 2. The Manhattan distance.

The Euclidean Distance between 2 variables The formula for calculating the distance between the two variables, given three persons scoring on each as shown below is: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 # 17-a

Basic Algorithm: Step 0: select K Step 1: randomly select initial cluster seeds Step 2: calculate distance from each object to each cluster seed. What type of distance should we use? K-means Clustering Squared Euclidean distance Step 3: Assign each object to the closest cluster Department of CS - DM - UHD 5

K-means Clustering Step 4: Compute the new centroid for each cluster Iterate: Calculate distance from objects to cluster centroids. Assign objects to closest cluster Recalculate new centroids Stop based on convergence criteria No change in clusters Max iterations Department of CS - DM - UHD 6

How the K-Mean Clustering algorithm works? Department of CS - DM - UHD 7

A Simple example showing the implementation of k-means algorithm (Using K=2) Department of CS - DM - UHD 8

Step 1: Initialization: Randomly we choose following two centroids (k=2) for two clusters. In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0). Department of CS - DM - UHD 9

Step 2: Now using these centroids (i.e. m1(1.0, 1.0), and m2(5.0, 7.0)) we compute the Euclidean distance of each object, as shown in table. Individual centroid 1 centroid 2 Thus, we obtain two clusters containing: {1,2,3} and {4,5,6,7}. Department of CS - DM - UHD 10

Step 2: Now we compute the new centroids as: m1(1.83, 2.33) m2(4.12, 5.38) Department of CS - DM - UHD 11

Step 3: Now using these centroids (i.e. m1(1.83, 2.33), and m2(4.12, 5.38)) to compute the Euclidean distance of each object, as shown in table. Department of CS - DM - UHD 12

Step 3: Therefore, the new clusters are: {1,2} and {3,4,5,6,7} Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1) Note: every time we need to compute the new centroids depending on the original table. Department of CS - DM - UHD 13

Step 4 : We compute the Euclidean distance The clusters obtained are: {1,2} and {3,4,5,6,7} Therefore, there is no change in the cluster. Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2} and {3,4,5,6,7}. Department of CS - DM - UHD 14

PLOT Department of CS - DM - UHD 15

Step - 0 Use K=2 Suppose A and C are Randomly selected as the initial means. Department of CS - DM - UHD 16

Step 1.1 Department of CS - DM - UHD 17

Step 1.1 The clusters obtained are: {A,B} and {C,D,E} Department of CS - DM - UHD 18

Step 1.1 PLOTS Department of CS - DM - UHD 19

Step 2.1 Department of CS - DM - UHD 20

Step 2.1 Therefore, the new clusters are: {A,B,C} and {D,E} Department of CS - DM - UHD 21

Step 2.1 PLOTS Department of CS - DM - UHD 22

Step 3 Algorithm has converged recalculating distances, reassigning cases to clusters results in no change. This is the final solution. Department of CS - DM - UHD 23

Demonstration of PAM Cluster the following data set of ten objects into two clusters i.e. k = 2. Consider a data set of ten objects as follows : Distribution of the data Department of CS - DM - UHD 24

Step 1: 1. Initialize k centers. 2. Let us assume x 2 and x 8 are selected as medoids, so the centers are c 1 = (3,4) and c 2 = (7,4) 3. Calculate distances to each center so as to associate each data object to its nearest medoid. Cost is calculated using Manhattan distance ( metric with r = 1). Costs to the nearest medoid are shown bold in the table. Department of CS - DM - UHD 25

Step 1: clusters after step 1 Department of CS - DM - UHD 26

Step 1: Cost (distance) to c 2 Department of CS - DM - UHD 27

Then the clusters become: Step 1: Cluster 1 = {(3,4)(2,6)(3,8)(4,7)} Cluster 2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,6)} Since the points (2,6) (3,8) and (4,7) are closer to c 1 hence they form one cluster whilst remaining points form another cluster. So the total cost involved is 20. Department of CS - DM - UHD 28

Step 1: Where cost between any two points is found using formula where x is any data object, c is the medoid, and d is the dimension of the object which in this case is 2. Total cost is the summation of the cost of data object from its medoid in its cluster so here: Department of CS - DM - UHD 29

Step 2: Select one of the nonmedoids O Let us assume O = (7,3), i.e. x 7. So now the medoids are c 1 (3,4) and O (7,3) If c1 and O are new medoids, calculate the total cost involved By using the formula in the step 1 clusters after step 2 Department of CS - DM - UHD 30

Step 2: Cost (distance) to c 1 Cost (distance) to c 2 Department of CS - DM - UHD 31

Step 2: So moving to O would be a bad idea, so the previous choice was good. So we try other nonmedoids and found that our first choice was the best. So the configuration does not change and algorithm terminates here (i.e. there is no change in the medoids). Department of CS - DM - UHD 32

Department of CS - DM - UHD 33