Discovery of frequent episodes in event sequences

Similar documents
Mining High Utility Episodes in Complex Event Sequences

Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences

Temporal data mining for root-cause analysis of machine faults in automotive assembly lines

Automated extraction of motivic patterns and application to the analysis of Debussy s Syrinx

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Chapter 6. Flip-Flops and Simple Flip-Flop Applications

Algorithm User Guide: Colocalization

Automated Accompaniment

CHAPTER 3. Melody Style Mining

Data Mining. Dr. Raed Ibraheem Hamed. University of Human Development, College of Science and Technology Department of Computer Science

Musical Harmonization with Constraints: A Survey. Overview. Computers and Music. Tonal Music

Training Note TR-06RD. Schedules. Schedule types

COMP Test on Psychology 320 Check on Mastery of Prerequisites

Statistical Consulting Topics. RCBD with a covariate

A Discrete Time Markov Chain Model for High Throughput Bidirectional Fano Decoders

Detecting Musical Key with Supervised Learning

Release Year Prediction for Songs

Analysis of local and global timing and pitch change in ordinary

Chapter 5 Flip-Flops and Related Devices

homework solutions for: Homework #4: Signal-to-Noise Ratio Estimation submitted to: Dr. Joseph Picone ECE 8993 Fundamentals of Speech Recognition

Example the number 21 has the following pairs of squares and numbers that produce this sum.

SINGING PITCH EXTRACTION BY VOICE VIBRATO/TREMOLO ESTIMATION AND INSTRUMENT PARTIAL DELETION

Section 6.8 Synthesis of Sequential Logic Page 1 of 8

DOES MOVIE SOUNDTRACK MATTER? THE ROLE OF SOUNDTRACK IN PREDICTING MOVIE REVENUE

Introduction. NAND Gate Latch. Digital Logic Design 1 FLIP-FLOP. Digital Logic Design 1

NCPC 2007 Problem A: Phone List 3. Problem A. Phone List

A STATISTICAL VIEW ON THE EXPRESSIVE TIMING OF PIANO ROLLED CHORDS

Flip-Flops and Related Devices. Wen-Hung Liao, Ph.D. 4/11/2001

SIX STEPS TO BUYING DATA LOSS PREVENTION PRODUCTS

Homework 2 Key-finding algorithm

Journal of the Association of Chartered Physiotherapists in Respiratory Care A guide to writing an experimental study

Story Tracking in Video News Broadcasts. Ph.D. Dissertation Jedrzej Miadowicz June 4, 2004

Music Emotion Recognition. Jaesung Lee. Chung-Ang University

ni.com Digital Signal Processing for Every Application

A Visualization of Relationships Among Papers Using Citation and Co-citation Information

Bilbo-Val: Automatic Identification of Bibliographical Zone in Papers

MC9211 Computer Organization

Pattern Discovery and Matching in Polyphonic Music and Other Multidimensional Datasets

A New Gate for Optimal Fault Tolerant & Testable Reversible Sequential Circuit Design

FACSAria I Standard Operation Protocol Basic Operation

Sample Analysis Design. Element2 - Basic Software Concepts (cont d)

An Efficient Closed Frequent Itemset Miner for the MOA Stream Mining System

WPA REGIONAL CONGRESS OSAKA Japan 2015

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

Chapter 6. sequential logic design. This is the beginning of the second part of this course, sequential logic.

PRO LIGNO Vol. 12 N pp

10GBASE-KR Start-Up Protocol

Objective: Write on the goal/objective sheet and give a before class rating. Determine the types of graphs appropriate for specific data.

SignalTap Plus System Analyzer

A High- Speed LFSR Design by the Application of Sample Period Reduction Technique for BCH Encoder

Objectives. Combinational logics Sequential logics Finite state machine Arithmetic circuits Datapath

LAB 1: Plotting a GM Plateau and Introduction to Statistical Distribution. A. Plotting a GM Plateau. This lab will have two sections, A and B.

A Study of Predict Sales Based on Random Forest Classification

THE MAJORITY of the time spent by automatic test

Department of Electrical & Electronic Engineering Imperial College of Science, Technology and Medicine. Project: Real-Time Speech Enhancement

Scan. This is a sample of the first 15 pages of the Scan chapter.

PowerMonic. FAQs [2/12]

Minimailer 4 OMR SPECIFICATION FOR INTELLIGENT MAILING SYSTEMS. 1. Introduction. 2. Mark function description. 3. Programming OMR Marks

hit), and assume that longer incidental sounds (forest noise, water, wind noise) resemble a Gaussian noise distribution.

Peak Dynamic Power Estimation of FPGA-mapped Digital Designs

Static Timing Analysis for Nanometer Designs

Audio Compression Technology for Voice Transmission

Experiments on musical instrument separation using multiplecause

MID-TERM EXAMINATION IN DATA MODELS AND DECISION MAKING 22:960:575

Real-Time Systems Dr. Rajib Mall Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Lyrics Classification using Naive Bayes

Composer Style Attribution

Sonic's Third Quarter Results Reflect Current Challenges

Adaptive Key Frame Selection for Efficient Video Coding

Automatic Music Clustering using Audio Attributes

National TV Index Q Bringing clarity to the National TV landscape.

VLSI Design: 3) Explain the various MOSFET Capacitances & their significance. 4) Draw a CMOS Inverter. Explain its transfer characteristics

SDS PODCAST EPISODE 96 FIVE MINUTE FRIDAY: THE BAYES THEOREM

Synchronous Sequential Logic

NH 67, Karur Trichy Highways, Puliyur C.F, Karur District UNIT-III SEQUENTIAL CIRCUITS

Music Information Retrieval with Temporal Features and Timbre

Unit 7.2. Terms. Words. Terms. (Table - 1)

UNIT IV. Sequential circuit

Topic 10. Multi-pitch Analysis

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Title characteristics and citations in economics

Music Genre Classification

Evaluation of Serial Periodic, Multi-Variable Data Visualizations

Article Title: Discovering the Influence of Sarcasm in Social Media Responses

On-Supporting Energy Balanced K-Barrier Coverage In Wireless Sensor Networks

Part 4: Introduction to Sequential Logic. Basic Sequential structure. Positive-edge-triggered D flip-flop. Flip-flops classified by inputs

PulseCounter Neutron & Gamma Spectrometry Software Manual

Sitting through commercials: How commercial break timing and duration affect viewership

FEASIBILITY STUDY OF USING EFLAWS ON QUALIFICATION OF NUCLEAR SPENT FUEL DISPOSAL CANISTER INSPECTION

Exploring the Rules in Species Counterpoint

Reference Guide Version 1.0

Soft Computing Approach To Automatic Test Pattern Generation For Sequential Vlsi Circuit

UNIT 1: DIGITAL LOGICAL CIRCUITS What is Digital Computer? OR Explain the block diagram of digital computers.

Dynamic bandwidth allocation scheme for multiple real-time VBR videos over ATM networks

The Measurement Tools and What They Do

Chapter 27. Inferences for Regression. Remembering Regression. An Example: Body Fat and Waist Size. Remembering Regression (cont.)

POST-PROCESSING FIDDLE : A REAL-TIME MULTI-PITCH TRACKING TECHNIQUE USING HARMONIC PARTIAL SUBTRACTION FOR USE WITHIN LIVE PERFORMANCE SYSTEMS

jsymbolic 2: New Developments and Research Opportunities

Eddy current tools for education and innovation

National Coalition for Core Arts Standards. Music Model Cornerstone Assessment: General Music Grades 3-5

Transcription:

Discovery of frequent episodes in event sequences Andres Kauts, Kait Kasak University of Tartu 2009 MTAT.03.249 Combinatorial Data Mining Algorithms

What is sequential data mining Sequencial data mining is a branch of data mining that deals with datasets in which events have a time of occurance.

Sequencial data-mining - where to use? Log analysis Security (intrusion detection) Analysing financial events (stock markets) Genetics (DNA-sequences) Document collection Time based shopping basket prediction...or anything else that looks like this:

Basics Data consists of events in a sequence. Given a set E of event types, an event is a pair A,t where A E type and t is an integer, the (occurrence) time of the event. An event sequence s on E is a triple s= A 1,t 1, A 2, t 2,..., A n,t n s,t s,t e where

Example Example Sequence: s= E,31, T,32, F,33, A,35, B,37, C,38,..., D,67 Example Window of size 5: s= A,35, B,37, C,38, E,39,35,40

Episodes Serial Parallel Complex A A E F C B B

Episodes Serial Parallel Complex A A E F C B B Episode is a collection events, with predefinded order of appearance. Episodes, in concept, are similar to itemsets. The main difference is, that items (events) they consists of, must appear in a certain timeframe (window) and might have a particular order.

At first there was apriori... Algorithms

Algorithms At first there was apriori... but sequencial data makes it a bit more complicated: Multiple events at the same time Order of appearance

Apriori... 1.Gather all possible event types from sequence 2.Generate first level candiates (episodes with one event) 3.Find if generated candiates are frequent 4.Generate next level super episodes of the frequent episodes found as new candiates 5.Wash, Rinse, & Repeat... 6.Output rules

Algorithms Two basic algorithms for finding frequent episodes: WINEPI - Sliding window approach MINEPI - Minimal occurences approach

Winepi Candidate episodes are generated A window is slided through the event-based data sequence Occurance of episodes is counted in every window Higher level episode candidates are generated based on frequent episodes found Input: window size and minimal frequency Output: frequent episodes in defined windows

Winepi frequency threshold : min_fr is used Episode α is frequent if fr(α, s, win) min_fr, i.e, if the frequency of α exceeds the minimum frequency threshold within the data sequence s and with window width win F(s, win, min_fr): a collection of frequent episodes in s with respect to win and min_fr Apriori trick holds: if an episode α is frequent in an event sequence s, then all subepisodes are frequent

Winepi Parallel episodes: For each candidate α maintain a counter α.event_count: how many events of α are present in the window When α.event_count becomes equal to α, indicating that α is entirely included in the window, save the starting time of the window in α.inwindow When α.event_count decreases again, increase the field α.freq_count by the number of windows where α remainded entirely in the window

Winepi Serial and complex episodes: use a state automata

Winepi window width is 40 seconds (last point is excluded). windows start and end before the sequence. D C A B D A B C 0 10 20 30 40 50 60 70 80 90 * Animation idea and sample data taken from: http://www.cs.helsinki.fi/u/ronkaine/dm/luentomateriaali/dami-011031.ppt

Winepi Strengths: Intuitive Not too heavy on memory Weaknesses: Slow with larger frequent episodes Some problems:

Minepi Candidate episodes are generated Minimal occurences of each candidate episode are counted Frequency of found minimal episodes is computed Higher level episode candidates are joined from level frequent episodes Max window width may be used.

Minepi Formally, given a episode α and an event sequence s, the interval [t s,t e ] is a minimal occurrence α of s, If α occurs in the window corresponding to the interval If α does not occur in any proper subinterval The set of minimal occurrences of an episode α in a given event sequence is denoted by mo(α): mo(α) = { [t s,t e ] [t s,t e ] is a minimal occurrence of α }

Minepi Example: Parallel episode β consisting of event types A and B has three minimal occurrences in s: {[30,40], [40,60], [60,70]}, α has one occurrence in s: {[60,80]} A β: α : B A B C D C A B D A B C 0 10 20 30 40 50 60 70 80 90 * Schnitt taken from: http://www.cs.helsinki.fi/u/ronkaine/dm/luentomateriaali/dami-011031.ppt

Minepi Example 2 (might be removed!): The parallel episode β consisting of event types A and B has four minimal occurrences in s: mo(β) = D {[35; 38); [46; 48); [47; 58); [57; 60)}. Minimal occurences of the partially ordered episode γ are: [35; 39); [46; 51); [57; 62). β A γ A C B B

Minepi episode rule: β [win 1 ] α [win 2 ] β and α are episodes such that If episode β has a minimal occurrence at interval [t s,t e ] with t e - t s win 1, then episode α occurs at interval [t s,t' e ] for some t' e such that t' e - t s win 2 confidence of the rule β [win 1 ] α [win 2 ] is: mo(α) / mo(β) where mo(β) is the number of minimal occurrences of β such that t e - t s win 1 mo(α) is the number of such occurrences where there is also an occurrence of α within the interval [t s,t s +win 2 ] frequency of the rule β [win 1 ] α [win 2 ] is: mo(α)

Minepi Strengths: Good performance with bigger episodes. More natural episode rules as there can be several time limits for one rule e.g. If A and B occur within 15 seconds, then C follows within 30 seconds Weaknesses: Memory hog

Unbounded Episodes Both Minepi and Winepi have fixed window width. But what if we're more interested in closeness of elements than fixed window width? To overcome this problem unbounded episodes are introduced: Unbounded episodes define maximal time t between any two events but no window width.

Unbounded Episodes Unbounded episodes are good, when one is more interested in the closeness of elements than the window width itself. 1) Element width E F C A F C Large window 2) E A B Large window F C F C

Win-Miner Max window size sets constraints to episode length. We might be more interested in variable-width episodes. Unbounded episodes can help, but are an incomplete solution. They are often open for too long window sizes (reducing confidence). To overcome these problems Win-Miner was introduced.

Win-Miner Find frequent unbounded episodes. Then find optimal window size by looking when increasing in window size decreases confidence. Input: support threshold, confidence threshold, maximum gap between events, decrease treshold.

Win-Miner

Case study Mining episode rules in STULONG dataset Nicolas Meger, Claire Leschi, Noël Lucas and Christophe Rigotti

Case study: Stulong Dataset is the result of a twenty-year long study of risk factors related to atherosclerosis in a population of 1417 middle-aged men. Win-Miner algorithm was used

Case study: Stulong First run: 6 results found. Each rule that had been discovered expresses knowledge that was well known. - Confidence, that experiment was working correctly. Additionally the window of importance for rules was found.

Case study: Stulong Example: If the patient has no hypercholesterolemia and if he sometimes follows his diet, then the patient has no hypercholesterolemia with a probability of 0.8 and this, within 40 months, which is the optimal window size for this rule. This rule is supported by 201 examples in the event sequence.

Case study: Stulong Second run: 217 results found. Again, many expected results were found. While some new ones and time of importance for some known rules was found.

Fuzzy Frequent Episodes Similar to MINEPI but the event occurences are not limited to values 0 or 1. Events have a probability of occurance and the minimal occurance of an episode is the product of its events. β E B F 0,2 0,7 0,3 In this example the minimal occurance of episode β would be: 0,2 0,7 0,3=0,042

Fuzzy Frequent Episodes Fuzzy frequent episodes are beneficial: If event-attributes can represent quantitative data. If event-attributes cannot be easily classified for instance: How little hair a subject needs to be considered bald?

Fuzzy Frequent Episodes Events mined: The number of different destination ports during last 2 seconds. Anomaly percentage = m/n * 100 % where n = total events, m = events not represented in training data minconfidence = 0.8, minsupport = 0.1, minoccurrence = 0.3 and window = 15s

Fuzzy Frequent Episodes vs Vanilla Episodes PN was divided into 3 Fuzzy sets or, in case of traditional episodes, 3 fixed intervals (LOW, MEDIUM, HIGH) False positive rates on the same training data:

Performance Winepi serial Minepi serial

Performance

Performance

Case study Mining Frequent Episodes for Relating Financial Events and Stock Trends Anny Ng and Ada Wai-chee Fu

Case study With a datased of financial news (775 days) harvested interesting keywords from it ( telecommunicatition stocks raise, Star TV-HK Telecom ) and tried to find relations between news events and events in stock market.

Case study - performance

Case study - performance

Case study - performance