Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Size: px

Start display at page:

Download "Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan"

Abigayle Fisher
5 years ago
Views:

1 Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University

2 Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Neuron Axon Terminal (transmitter) Cited from Sciseek.com Dendrites (receiver) Axon (wires) Question: How are the neurons connected? Action Potentials (Spikes) 2

3 Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Multi-Electrode Array (MEA) Neurons grown on MEA Chip A B C A B C time Spike Train Stream 3

4 Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Find Repeating Patterns Infer Network Connectivity 4

5 Fast data mining of spike train stream on Graphics Processing Units (GPUs) MEA Chip GPU Chip Multi-Electrode Array (MEA) NVIDIA GTX280 Graphics Card 5

6 Fast data mining of spike train stream on Graphics Processing Units (GPUs) Two key algorithmic strategies to address scalability problem on GPU A hybrid mining approach A two-pass elimination approach 6

7 Event stream data: sequence of neurons firing ( E 1,t 1 ),( E 2,t 2 ),...,( E n,t ) n Event of Type A occurred at t = 6 Neuron A B 1 1 C D Time Event of Type D occurred at t = 5 7

Pattern or Episode Inter-event constraint Occurrences (Non-overlapped) A 1 1 1 1 1 1

8 Pattern or Episode Inter-event constraint Occurrences (Non-overlapped) A Neurons B C D Time 1 Episode appears twice in the event stream. 8

9 Data mining problem: Find all possible episodes / patterns which occur more than X-times in the event sequence. Challenge: Combinatorial Explosion: large number of episodes to count Episode Size/Length: A A B A B C A B C D B B A A C B A C B D A C B A C B C A A C D B A D B C A D C B 9

10 Mining Algorithm (A level wise procedure to control combinatorial explosion) Generate an initial list of candidate size-1 episodes Repeat until - no more candidate episodes Count: Occurrences of size-m candidate episodes Prune: Retain only frequent episodes Candidate Generation: size-(m+1) candidate episodes from N-size frequent episodes Output all the frequent episodes Computational bottleneck 10

11 Counting Algorithm (for one episode) Episode: Accept_A() Accept_B() Accept_C() Accept_D() A 1 B 4 C 10 D 17 A 2 B 12 C 13 A A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 11

12 Find an efficient counting algorithm on GPU to count the occurrences of N size-m episodes in an event stream. Address scalability problem on GPU s massive parallel execution architecture. 12

13 One episode per GPU thread (PTPE) Each thread counts one episode Simple extension of serial counting GPU MP MP MP N Episodes N GPU Threads SP SP SP SM SM SM Event Stream Global Memory Efficient when the number of episode is larger than the number of GPU cores. 13

14 Not enough episodes/thread, some GPU cores will be idle. Solution: Increase the level of parallelism. Multiple Thread per Episode (MTPE) N Episodes NM N GPU Threads Event Stream M Event Segments 14

15 Problem with simple count merge. 15

Define a switching threshold - Crossover

MTPE GPU computing capacity CP = MP B MP

16 Choose the right algorithm with respect to the number of episodes N. Define a switching threshold - Crossover point (CP) No If N < CP Yes Use PTPE Use MTPE GPU computing capacity CP = MP B MP T B f (size) MP : Number of multi - processors B MP : Block per multi - processor T B : Thread per block Performance Penalty Factor 16

17 Problem: Original counting algorithm is too complex for a GPU kernel function. Episode: Accept_A() Accept_B() Accept_C() Accept_D() A 1 B 4 C 10 D 17 A 2 B 12 C 13 A A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 17

18 Problem: Original counting algorithm is too complex for a GPU kernel function. Accept_A() Accept B() Accept_C() Accept_D() SP MP SP MP SP MP A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 SM SM SM Global Memory Large shared memory usage Large register file usage Large number of branching instructions 18

19 Solution: PreElim algorithm Less constrained counting Simple kernel function Upper bound only Episode: A (,5] B (,10] C (,5] D Accept_A() Accept_B() Accept_C() Accept_D() A 12 5 B 4 C D 17 B A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 19

20 A simpler kernel function Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size

21 Solution: Two-pass elimination approach PASS 1: Less Constrained Counting PASS 2: Normal Counting Episodes Threads Fewer Episodes Threads Event Stream Event Stream 21

22 A simpler kernel function Compile Time Difference Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size Run Time Difference Local Memory Load and Store Divergent Branching Two Pass 24,770,310 12,258,590 Hybrid 210,773,785 14,161,399 22

23 Hardware Computer (custom-built) Intel Core2 2.33GHz 4GB memory Graphics Card (Nvidia GTX 280 GPU) 240 cores (30 MPs * 8 1.3GHz 1GB global memory 16K shared memory for each MP 23

24 Datasets Synthetic (Sym26) 60 seconds with 50,000 events Real (Culture growing for 5 weeks) Day 33: ( events) Day 34: ( events) Day 35: ( events) 24

25 PTPE vs MTPE Crossover points 25

26 Performance of the Hybrid Approach 1200 PTPE PTPE MTPE MTPE Hybrid Time (ms) Crossover points Episode Size Episode Number: Sym26 dataset, Support =

27 Crossover Point Estimation f (size) = a is a better fit. size + b A least square fit is performed. 27

28 Two-pass approach vs Hybrid approach 99.9% fewer episodes 28

6 7036.6 Two Pass 160.4 1716.6 12602.6 41581.7 1844.

29 Performance of the Two-pass approach One Pass Two Pass Total # First Pass Cull 160K 200K 120K 160K Time (ms) 80K Episode # 120K 80K 40K 40K 0K One Pass Two Pass Episode Size 0K Total # First Pass Cull Episode Size dataset, Support =

Percentage of episodes eliminated by each pass 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% First Pass Second Pass 3000 3050 3100

30 Percentage of episodes eliminated by each pass 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% First Pass Second Pass Support dataset, episode size = 4 30

31 GPU vs CPU GPU is always faster than CPU 5x - 15x speedup Fair comparison Two-pass algorithm used Maximum threading for both 31

32 Massive parallelism is required for conquering near exponential search space GPU s far more accessible than high performance clusters Frequent episode mining Not data parallel Redesigned algorithm Framework for real-time and interactive analysis of spike train experimental data 32

A fast temporal data mining framework on GPUs Commoditized system Massive parallel execution architecture Two programming strategies A hybrid

33 A fast temporal data mining framework on GPUs Commoditized system Massive parallel execution architecture Two programming strategies A hybrid approach Increase level of parallelism (data segmentation + map-reduce) Two-pass elimination approach Decrease algorithm complexity (Task decomposition) 33

34 Questions. 34

Parallel Execution via pthreads Optimized for CPU execution Minimize disk access Cache performance Implements Two-Pass Approach PreElim Simpler/

35 Parallel Execution via pthreads Optimized for CPU execution Minimize disk access Cache performance Implements Two-Pass Approach PreElim Simpler/ Quicker state machine Full State Machine Slower but is required to eliminate all unsupported episodes... A B D E F Z G... A B C D E F G H ACE ACDE AEF EFG

36 A B C D Level-wise N-size frequent episodes => (N+1)-size candidates A B C D A B C D

PRACE Autumn School GPU Programming

PRACE Autumn School GPU Programming PRACE Autumn School 2010 GPU Programming October 25-29, 2010 PRACE Autumn School, Oct 2010 1 Outline GPU Programming Track Tuesday 26th GPGPU: General-purpose GPU Programming CUDA Architecture, Threading