Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan

Similar documents
PRACE Autumn School GPU Programming

Scalability of MB-level Parallelism for H.264 Decoding

High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

Prime Num Generator - Maker Faire 2014

Optimization of Multi-Channel BCH Error Decoding for Common Cases. Russell Dill Master's Thesis Defense April 20, 2015

Transparent low-overhead checkpoint for GPU-accelerated clusters

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

Highly Parallel HEVC Decoding for Heterogeneous Systems with CPU and GPU

Communication Avoiding Successive Band Reduction

Understanding Compression Technologies for HD and Megapixel Surveillance

Research Article Efficient Parallel Video Processing Techniques on GPU: From Framework to Implementation

Embedded System Design

Milestone Leverages Intel Processors with Intel Quick Sync Video to Create Breakthrough Capabilities for Video Surveillance and Monitoring

Gated Driver Tree Based Power Optimized Multi-Bit Flip-Flops

Implementation of an MPEG Codec on the Tilera TM 64 Processor

Lossless Compression Algorithms for Direct- Write Lithography Systems

ECEN689: Special Topics in High-Speed Links Circuits and Systems Spring 2011

HEVC Real-time Decoding

Outline. 1 Reiteration. 2 Dynamic scheduling - Tomasulo. 3 Superscalar, VLIW. 4 Speculation. 5 ILP limitations. 6 What we have done so far.

Masters of Science in COMPUTER ENGINEERING

Processor time 9 Used memory 9. Lost video frames 11 Storage buffer 11 Received rate 11

Low Power VLSI Circuits and Systems Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Carry Chains for FPGAs

Universal Parallel Computing Research Center The Center for New Music and Audio Technologies University of California, Berkeley

MMI: A General Narrow Interface for Memory Devices

Design Project: Designing a Viterbi Decoder (PART I)

Sharif University of Technology. SoC: Introduction

ECE 4220 Real Time Embedded Systems Final Project Spectrum Analyzer

Frame Interpolation and Motion Blur for Film Production and Presentation GTC Conference, San Jose

Impact of Intermittent Faults on Nanocomputing Devices

VVD: VCR operations for Video on Demand

Controlling Peak Power During Scan Testing

GPU Acceleration of a Production Molecular Docking Code

Discovery of frequent episodes in event sequences

Performance Driven Reliable Link Design for Network on Chips

Fooling the Masses with Performance Results: Old Classics & Some New Ideas

MULTI-CORE SOFTWARE ARCHITECTURE FOR THE SCALABLE HEVC DECODER. Wassim Hamidouche, Mickael Raulet and Olivier Déforges

Power Efficient Architectures to Accelerate Deep Convolutional Neural Networks for edge computing and IoT

A parallel HEVC encoder scheme based on Multi-core platform Shu Jun1,2,3,a, Hu Dong1,2,3,b

New-Generation Scalable Motion Processing from Mobile to 4K and Beyond

Figure 1: Feature Vector Sequence Generator block diagram.

J. Maillard, J. Silva. Laboratoire de Physique Corpusculaire, College de France. Paris, France

RedEye Analog ConvNet Image Sensor Architecture for Continuous Mobile Vision

NVCP recommended settings for TSW incl GSync 5. Screen Settings in TSW - Graphics settings 6. TSW Settings explained and recommendations 7

Designing for High Speed-Performance in CPLDs and FPGAs

1) New Paths to New Machine Learning Science. 2) How an Unruly Mob Almost Stole. Jeff Howbert University of Washington

EyeFace SDK v Technical Sheet

Long and Fast Up/Down Counters Pushpinder Kaur CHOUHAN 6 th Jan, 2003

This paper is a preprint of a paper accepted by Electronics Letters and is subject to Institution of Engineering and Technology Copyright.

On the Characterization of Distributed Virtual Environment Systems

Video-on-Demand. Nick Caggiano Walter Phillips

CPS311 Lecture: Sequential Circuits

Distributed Cluster Processing to Evaluate Interlaced Run-Length Compression Schemes

IEEE Santa Clara ComSoc/CAS Weekend Workshop Event-based analog sensing

AN EFFECTIVE CACHE FOR THE ANYWHERE PIXEL ROUTER

Multicore Design Considerations

Optimizing the Startup Time of Embedded Systems: A Case Study of Digital TV

SPATIAL LIGHT MODULATORS

Data Converters and DSPs Getting Closer to Sensors

REAL-TIME H.264 ENCODING BY THREAD-LEVEL PARALLELISM: GAINS AND PITFALLS

8 DIGITAL SIGNAL PROCESSOR IN OPTICAL TOMOGRAPHY SYSTEM

Optical clock distribution for a more efficient use of DRAMs

Amon: Advanced Mesh-Like Optical NoC

Using Embedded Dynamic Random Access Memory to Reduce Energy Consumption of Magnetic Recording Read Channel

VLSI Digital Signal Processing

Discovering Sequential Association Rules with Constraints and Time Lags in Multiple Sequences

Hybrid Discrete-Continuous Computer Architectures for Post-Moore s-law Era

MindMouse. This project is written in C++ and uses the following Libraries: LibSvm, kissfft, BOOST File System, and Emotiv Research Edition SDK.

Adaptive Key Frame Selection for Efficient Video Coding

FPGA Based Implementation of Convolutional Encoder- Viterbi Decoder Using Multiple Booting Technique

Automatic optimization of image capture on mobile devices by human and non-human agents

An Improved Hardware Implementation of the Grain-128a Stream Cipher

Upgrading a FIR Compiler v3.1.x Design to v3.2.x

On the Rules of Low-Power Design

A video signal processor for motioncompensated field-rate upconversion in consumer television

EE5780 Advanced VLSI CAD

DC Ultra. Concurrent Timing, Area, Power and Test Optimization. Overview

ADVANCES in semiconductor technology are contributing

Chapter 3 Unit Combinational

140 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 12, NO. 2, FEBRUARY 2004

Performance and Energy Consumption Analysis of the X265 Video Encoder

nmos transistor Basics of VLSI Design and Test Solution: CMOS pmos transistor CMOS Inverter First-Order DC Analysis CMOS Inverter: Transient Response

Achieving Faster Time to Tapeout with In-Design, Signoff-Quality Metal Fill

3/5/2017. A Register Stores a Set of Bits. ECE 120: Introduction to Computing. Add an Input to Control Changing a Register s Bits

Spatial Light Modulators XY Series

Data Representation. signals can vary continuously across an infinite range of values e.g., frequencies on an old-fashioned radio with a dial

Compressed-Sensing-Enabled Video Streaming for Wireless Multimedia Sensor Networks Abstract:

Hardware Implementation for the HEVC Fractional Motion Estimation Targeting Real-Time and Low-Energy

Reconfigurable Neural Net Chip with 32K Connections

Application of A Disk Migration Module in Virtual Machine live Migration

Efficient implementation of a spectrum scanner on a software-defined radio platform

WiBench: An Open Source Kernel Suite for Benchmarking Wireless Systems

1ms Column Parallel Vision System and It's Application of High Speed Target Tracking

Build Applications Tailored for Remote Signal Monitoring with the Signal Hound BB60C

Hardware Implementation of Viterbi Decoder for Wireless Applications

OddCI: On-Demand Distributed Computing Infrastructure

Performance Evolution of 16 Bit Processor in FPGA using State Encoding Techniques

Doubletalk Detection

FPGA Digital Signal Processing. Derek Kozel July 15, 2017

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 3, MARCH GHEVC: An Efficient HEVC Decoder for Graphics Processing Units

Transcription:

Yong Cao, Debprakash Patnaik, Sean Ponce, Jeremy Archuleta, Patrick Butler, Wu-chun Feng, and Naren Ramakrishnan Virginia Polytechnic Institute and State University

Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Neuron Axon Terminal (transmitter) Cited from Sciseek.com Dendrites (receiver) Axon (wires) Question: How are the neurons connected? Action Potentials (Spikes) 2

Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Multi-Electrode Array (MEA) Neurons grown on MEA Chip A B C A B C time Spike Train Stream 3

Reverse-engineer the brain National Academy of Engineering Top 5 Grand Challenges Find Repeating Patterns Infer Network Connectivity 4

Fast data mining of spike train stream on Graphics Processing Units (GPUs) MEA Chip GPU Chip Multi-Electrode Array (MEA) NVIDIA GTX280 Graphics Card 5

Fast data mining of spike train stream on Graphics Processing Units (GPUs) Two key algorithmic strategies to address scalability problem on GPU A hybrid mining approach A two-pass elimination approach 6

Event stream data: sequence of neurons firing ( E 1,t 1 ),( E 2,t 2 ),...,( E n,t ) n Event of Type A occurred at t = 6 Neuron A 1 1 1 B 1 1 C 1 1 1 D 1 1 1 1 Time Event of Type D occurred at t = 5 7

Pattern or Episode Inter-event constraint Occurrences (Non-overlapped) A 1 1 1 1 1 1 Neurons B 1 1 1 1 C 1 1 1 1 D 1 1 11 1 Time 1 Episode appears twice in the event stream. 8

Data mining problem: Find all possible episodes / patterns which occur more than X-times in the event sequence. Challenge: Combinatorial Explosion: large number of episodes to count Episode Size/Length: 1 2 3 4 A A B A B C A B C D B B A A C B A C B D A C B A C B C A A C D B A D B C A D C B 9

Mining Algorithm (A level wise procedure to control combinatorial explosion) Generate an initial list of candidate size-1 episodes Repeat until - no more candidate episodes Count: Occurrences of size-m candidate episodes Prune: Retain only frequent episodes Candidate Generation: size-(m+1) candidate episodes from N-size frequent episodes Output all the frequent episodes Computational bottleneck 10

Counting Algorithm (for one episode) Episode: Accept_A() Accept_B() Accept_C() Accept_D() A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 11

Find an efficient counting algorithm on GPU to count the occurrences of N size-m episodes in an event stream. Address scalability problem on GPU s massive parallel execution architecture. 12

One episode per GPU thread (PTPE) Each thread counts one episode Simple extension of serial counting GPU MP MP MP N Episodes N GPU Threads SP SP SP SM SM SM Event Stream Global Memory Efficient when the number of episode is larger than the number of GPU cores. 13

Not enough episodes/thread, some GPU cores will be idle. Solution: Increase the level of parallelism. Multiple Thread per Episode (MTPE) N Episodes NM N GPU Threads Event Stream M Event Segments 14

Problem with simple count merge. 15

Choose the right algorithm with respect to the number of episodes N. Define a switching threshold - Crossover point (CP) No If N < CP Yes Use PTPE Use MTPE GPU computing capacity CP = MP B MP T B f (size) MP : Number of multi - processors B MP : Block per multi - processor T B : Thread per block Performance Penalty Factor 16

Problem: Original counting algorithm is too complex for a GPU kernel function. Episode: Accept_A() Accept_B() Accept_C() Accept_D() A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 5 10 A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 17

Problem: Original counting algorithm is too complex for a GPU kernel function. Accept_A() Accept B() Accept_C() Accept_D() SP MP SP MP SP MP A 1 B 4 C 10 D 17 A 2 B 12 C 13 A 5 SM SM SM Global Memory Large shared memory usage Large register file usage Large number of branching instructions 18

Solution: PreElim algorithm Less constrained counting Simple kernel function Upper bound only Episode: A (,5] B (,10] C (,5] D Accept_A() Accept_B() Accept_C() Accept_D() A 12 5 B 4 C 10 13 D 17 B 12 5 10 A 1 A 2 B 4 A 5 C 10 B 12 C 13 D 17 Event Stream 19

A simpler kernel function Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 20

Solution: Two-pass elimination approach PASS 1: Less Constrained Counting PASS 2: Normal Counting Episodes Threads Fewer Episodes Threads Event Stream Event Stream 21

A simpler kernel function Compile Time Difference Shared Memory Register Local Memory PreElim 4 x Episode Size 13 0 Normal Counting 44 x Episode Size 17 80 Run Time Difference Local Memory Load and Store Divergent Branching Two Pass 24,770,310 12,258,590 Hybrid 210,773,785 14,161,399 22

Hardware Computer (custom-built) Intel Core2 Quad @ 2.33GHz 4GB memory Graphics Card (Nvidia GTX 280 GPU) 240 cores (30 MPs * 8 cores) @ 1.3GHz 1GB global memory 16K shared memory for each MP 23

Datasets Synthetic (Sym26) 60 seconds with 50,000 events Real (Culture growing for 5 weeks) Day 33: 2-1-33 (333478 events) Day 34: 2-1-34 (406795 events) Day 35: 2-1-35 (526380 events) 24

PTPE vs MTPE Crossover points 25

Performance of the Hybrid Approach 1200 PTPE PTPE MTPE MTPE Hybrid Time (ms) 1000 800 600 400 200 0 Crossover points 1 2 3 4 5 6 7 Episode Size Episode Number: Sym26 dataset, Support = 100 26

Crossover Point Estimation f (size) = a is a better fit. size + b A least square fit is performed. 27

Two-pass approach vs Hybrid approach 99.9% fewer episodes 28

Performance of the Two-pass approach One Pass Two Pass Total # First Pass Cull 160K 200K 120K 160K Time (ms) 80K Episode # 120K 80K 40K 40K 0K 1 2 3 4 5 One Pass 93.2 1839.8 16139.7 132752.6 7036.6 Two Pass 160.4 1716.6 12602.6 41581.7 1844.6 Episode Size 0K 1 2 3 4 5 Total # 64 6210 33623 173408 6288 First Pass Cull 18 2677 21442 169360 6288 Episode Size 2-1-35 dataset, Support = 3150 29

Percentage of episodes eliminated by each pass 100% 99% 98% 97% 96% 95% 94% 93% 92% 91% First Pass Second Pass 3000 3050 3100 3150 3200 3250 3300 3350 3400 3450 3500 3550 3600 3650 3700 3750 3800 3850 3900 3950 4000 Support 2-1-35 dataset, episode size = 4 30

GPU vs CPU GPU is always faster than CPU 5x - 15x speedup Fair comparison Two-pass algorithm used Maximum threading for both 31

Massive parallelism is required for conquering near exponential search space GPU s far more accessible than high performance clusters Frequent episode mining Not data parallel Redesigned algorithm Framework for real-time and interactive analysis of spike train experimental data 32

A fast temporal data mining framework on GPUs Commoditized system Massive parallel execution architecture Two programming strategies A hybrid approach Increase level of parallelism (data segmentation + map-reduce) Two-pass elimination approach Decrease algorithm complexity (Task decomposition) 33

Questions. 34

Parallel Execution via pthreads Optimized for CPU execution Minimize disk access Cache performance Implements Two-Pass Approach PreElim Simpler/ Quicker state machine Full State Machine Slower but is required to eliminate all unsupported episodes... A B D E F Z G... A B C D E F G H ACE ACDE AEF EFG

A B C D Level-wise N-size frequent episodes => (N+1)-size candidates 1 1 1 + A B C D 1 1 1 A B C D 1 1 1 1