Generalized Pattern Matching Micro-Engine

Size: px

Start display at page:

Download "Generalized Pattern Matching Micro-Engine"

Daniella Owen
5 years ago
Views:

1 Generalized Pattern Matching Micro-Engine Yuanwei Fang*, Raihan Rasool, Dilip Vasudevan*, Andrew A. Chien* University of Chicago * Argonne National Laboratory King Faisal University

2 Big Data Applications Deep Packet Inspection Bioinformatics (DNA Alignment) JSON/XML Parsing Signal Triggering 6/24/2014 UNIVERSITY OF CHICAGO 2

requirement: > 75 Tera DFAops/s Power budget : 200 W

3 Deep Packet Inspection High speed network : 100Gb/s Growing number of patterns: 6000 Snort Rules Speed requirement: > 75 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 375Gops/J 6/24/2014 UNIVERSITY OF CHICAGO 3

4 Bioinformatics (DNA Alignment) Bioinformatics database: millions of species Speed requirement: > 1 Tera DFAops/s Power budget : 200 W Energy efficiency requirement: > 5 Gops/J Genome size: 130G base pairs 6/24/2014 UNIVERSITY OF CHICAGO 4

5 Deterministic Finite Automata (DFA) 6/24/2014 UNIVERSITY OF CHICAGO 5

6 Programmable Approaches target Intel Xeon E5-2600: 17G DFAops/second with 130W, 0.13Gops/J ; 6/24/2014 UNIVERSITY OF CHICAGO 6

7 Approach Workload M input characters(m DFA transitions) N DFA rules perform on the M input characters Goal Compute N x M transitions efficiently Approach Parallelize DFA execution Fused Instruction 6/24/2014 UNIVERSITY OF CHICAGO 7

8 What Is Micro-Engine Generalized Pattern Matching Micro-Engine ( GenPM ) is one micro-engine of 10x10 approach Local Memory I-Cache Basic RISC CPU I-Cache Microengine 2 I-Cache Microengine 3 I-Cache Microengine 4 I-Cache GenPM I-Cache Microengine 6 I-Cache Microengine 7 I-Cache Microengine 8 Shared L1 Data Cache 6/24/2014 UNIVERSITY OF CHICAGO 8

9 GenPM Micro Architecture 6/24/2014 UNIVERSITY OF CHICAGO 9

10 Fused Instructions: Multi-Step String buffer a b c Acc_Vec 0 1 Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 10

11 Fused Instructions: Multi-Step String buffer a b c Acc_Vec 0 1 Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 11

12 Fused Instructions: Multi-Step String buffer a b c Acc_Vec Current State A D Q 1 Q 4 ALU address Local Mem Accept ENB Next State 6/24/2014 UNIVERSITY OF CHICAGO 12

13 Fused Instructions: Multi-Step String buffer a b c Acc_Vec Current State Acc_Vec A D ENB Q 1 Q 4 ALU address Local Mem Accept CHECK Next State 6/24/2014 UNIVERSITY OF CHICAGO 13

14 Parallel DFA: Vector Instruction SSE ADD /24/2014 UNIVERSITY OF CHICAGO 14

15 Parallel DFA: Vector Instruction GMVSNEXT DFAop DFAop DFAop DFAop DFAop DFAop DFAop 6/24/2014 UNIVERSITY OF CHICAGO 15

16 GenPM Code Example Data movement Multi-step parallel DFA execution Find precise matching position 6/24/2014 UNIVERSITY OF CHICAGO 16

17 Methodology Design space: Parallelism and step length Baseline 32-bit 6-stage in-order RISC 4GB DDR3 DRAM 32KB L1 I-cache, 24KB L1 D-cache, 512KB L2 (modeled on Intel Silverthorne) GenPM 1MB Local memory (up to 64 banks) Vector and Fused Instructions Performance/Power Model Core : 32nm synthesis by Synopsys Processor Designer Memories : MARSSX86/CACTI 6 + DRAMSim2 Workload 64 Snort rules from snapshot, 10KB random network dump 6/24/2014 UNIVERSITY OF CHICAGO 17

18 speedup versus RISC Performance GenPM_8way Speedup GenPM_64way step length 6/24/2014 UNIVERSITY OF CHICAGO 18

19 energy improvement versus RISC Energy Efficiency GenPM_8way GenPM_64way step length 6/24/2014 UNIVERSITY OF CHICAGO 19

20 Throughput per watt(gops/j) Throughput/watt (absolute) GenPM_8way Throughput/watt GenPM_64way step length Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second 6/24/2014 UNIVERSITY OF CHICAGO 20

21 total energy Energy Breakdown 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% RISC GenPM_8B_1S GenPM_8B_8S GenPM_8B_16S GenPM_64B_1S GenPM_64B_8S GenPM_64B_16S LM_max = 83% LM L1_I L1_D L2 DRAM Core 6/24/2014 UNIVERSITY OF CHICAGO 21

22 General Comparison 6/24/2014 UNIVERSITY OF CHICAGO 22

23 Related Work ASIC: [Brodie, et.al. ISCA 2006], [Titanic System RXP], [ Cisco SCE ] FPGA: [Yang Xu, et.al. ANCS 2011], [ T Song, et.al. INFOCOM 2008], [I Sourdis et.al. VLSI 2008] CPU: [Mytkowicz et.al. ASPLOS 2014 ], [ Intel HyperScan] GPU: [Vasiliadis G, et.al. CCS 2011], [ Lin CH, et.al. INFOCOM 2012] SoC: [C Johnson et.al. ISSCC 2010 ], [ Cavium Octeon ], [ IBM PowerEN ] 6/24/2014 UNIVERSITY OF CHICAGO 23

24 Summery GenPM is a high performance and energy efficient accelerator for pattern matching workloads ISA exploits parallelism and multi-step execution Scale to a 75W chip, GenPM delivers > 2.6 Tera DFAops/second GenPM approaches ASIC efficiency and integrates it into a programmable core 6/24/2014 UNIVERSITY OF CHICAGO 24

25 Future Work DFA table compression Scale up with multiple GenPM micro-engines Explore more applications 6/24/2014 UNIVERSITY OF CHICAGO 25

26 Acknowledgements Defense Advanced Research Projects Agency (DARPA) Agilent Technologies (now Keysight Technologies) Synopsys Academic program Dr. Tung Hoang and members of the Large Scale Systems Group in the Department of Computer Science 6/24/2014 UNIVERSITY OF CHICAGO 26

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities IBM Corporation

High Performance Microprocessor Design and Automation: Overview, Challenges and Opportunities Introduction About Myself What to expect out of this lecture Understand the current trend in the IC Design