An Efcient Regular Expressions Compression Algorithm From A

  • Slides: 19
Download presentation
An Efficient Regular Expressions Compression Algorithm From A New Perspective Authors : Tingwen Liu,

An Efficient Regular Expressions Compression Algorithm From A New Perspective Authors : Tingwen Liu, Yifu Yang, Yanbing Liu, Yong Sun, Li Guo Publisher : INFOCOM, 2011 Proceedings IEEE Presenter : 楊皓中 Date : 2014/05/28 Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan R. O. C.

Introduction l The paper focus on reducing memory usage of composite DFAs by compressing

Introduction l The paper focus on reducing memory usage of composite DFAs by compressing transitions. l Previous works l • based on the observation that there are many common outgoing transitions for the same character among some states In this paper • we try to obtain memory reduction by exploiting transitions redundancy among states and transitions distribution inside states. National Cheng Kung University CSIE Computer & Internet Architecture Lab 2

Definition of cluster l l regular expression : . *A. {2}CD Obviously, DFA is

Definition of cluster l l regular expression : . *A. {2}CD Obviously, DFA is a digraph, thus we can get a unique determinate trietree after traversing DFA by level ( breadth first traversal), if we stipulate that we traverse the son states by the label from small to large. National Cheng Kung University CSIE Computer & Internet Architecture Lab 3

Definition of cluster l In the trie-tree • if state r has a transition

Definition of cluster l In the trie-tree • if state r has a transition to state s, we call r is the father state of s, conversely s is the son state of r. A states set is called a cluster if it is composed of all the son states of a certain state. The son states set of state 0 is {1}, l so {1} is a cluster. By the same token l the state sets {2, 3}, {4, 5}, {6, 7}, {8}~{13} are clusters too l National Cheng Kung University CSIE Computer & Internet Architecture Lab 4

Transition characteristics l For a state, its outing transitions transfer to many distinct ”next-states”.

Transition characteristics l For a state, its outing transitions transfer to many distinct ”next-states”. • this observation is useless to reduce memory usage of DFAs, because the number of distinct ”next-states” is more than 25 on average l we observe that the average number of distinct clusters is usually less than 5 (Column 3), which is much smaller than that of distinct ”nextstates”. l Furthermore, the transitions of each state concentratively transfer to the maximum two clusters (Top-2 clusters for short). • these transitions occupy more than 95% of all transitions from Table. II. National Cheng Kung University CSIE Computer & Internet Architecture Lab 5

Transition characteristics National Cheng Kung University CSIE Computer & Internet Architecture Lab 6

Transition characteristics National Cheng Kung University CSIE Computer & Internet Architecture Lab 6

Description l l In DFAs, each state has 2^8 transitions and each transition transfer

Description l l In DFAs, each state has 2^8 transitions and each transition transfer to a certain determinate state, thus it transfer to a determinate cluster from above analysis. We can divide all the transitions and store them into three different matrixes T 1, T 2, T 3 as shown in Fig. 3 • • • l the maximum transitions that transfer to the same cluster were put into matrix T 1 the second maximum transitions into matrix T 2. the remaining transitions into matrix T 3. For example, all the transitions of state 1 transfer to state 2 or 3, both of which belong to cluster {2, 3}, thus they are all put into T 1, and T 2 and T 3 have no transitions for state 1. National Cheng Kung University CSIE Computer & Internet Architecture Lab 7

National Cheng Kung University CSIE Computer & Internet Architecture Lab 8

National Cheng Kung University CSIE Computer & Internet Architecture Lab 8

CLUSTER-BASED SPLITTING COMPRESSION ALGORITHM l l Cluster-based Splitting Compression Algorithm (CSCA) A. Compression Process

CLUSTER-BASED SPLITTING COMPRESSION ALGORITHM l l Cluster-based Splitting Compression Algorithm (CSCA) A. Compression Process • • 1) Rearranging and Dividing: 2) Matrix Compressing: B. Lookup C. Complexity Analysis National Cheng Kung University CSIE Computer & Internet Architecture Lab 9

Compression Process : Rearranging and Dividing l We need to rearrange DFA level by

Compression Process : Rearranging and Dividing l We need to rearrange DFA level by level according with the character sequence in 2^8. After this step, the trie-tree of DFA has the following properties: l The benefit is that l • • • all son states of one state is in a continuous sequence. Thus the state number in the same cluster is continuous. the minus of the maximum and minimum state number is less than 256 divide the DFA matrix. National Cheng Kung University CSIE Computer & Internet Architecture Lab 10

Compression Process : l l l Matrix Compressing Matrixes T 1 and T 2

Compression Process : l l l Matrix Compressing Matrixes T 1 and T 2 have the same structure: • the transitions of each state transfer into the same cluster, so we use the same algorithm to compress them use another different algorithm to compress T 3. • Obviously matrix T 3 is a sparse matrix, in this paper we do not discuss how to compress it but use classical sparse matrix compression algorithm Combinative row. National Cheng Kung University CSIE Computer & Internet Architecture Lab 11

Compression Process : Matrix Compressing l National Cheng Kung University CSIE Computer & Internet

Compression Process : Matrix Compressing l National Cheng Kung University CSIE Computer & Internet Architecture Lab 12

Compression Process : l l l Matrix Compressing The main idea of compressing matrixes

Compression Process : l l l Matrix Compressing The main idea of compressing matrixes T 1 and T 2 is: convert the matrix into an offset matrix, which can generate many combinative rows, and then merge them in order to reduce memory usage. As mentioned above, after rearranging DFA level by level, the state number in the same cluster is in a continuous sequence, National Cheng Kung University CSIE Computer & Internet Architecture Lab 13

Compression Process : Matrix Compressing National Cheng Kung University CSIE Computer & Internet Architecture

Compression Process : Matrix Compressing National Cheng Kung University CSIE Computer & Internet Architecture Lab 14

Lookup l l The function need to decide which part the next state is

Lookup l l The function need to decide which part the next state is in first. matrix T 3 is compressed by sparse matrix compression algorithms, which can determine whether the element in T 3 is effective or not by itself, thus we can save a bitmap by accessing T 3 before T 2, as shown in Algorithm 2. National Cheng Kung University CSIE Computer & Internet Architecture Lab 15

Complexity Analysis l l l SCR (Spatial Compression Ratio) to evaluate the compression effect

Complexity Analysis l l l SCR (Spatial Compression Ratio) to evaluate the compression effect R 1 and R 2 are offset matrixes, can store each element in one byte base 1 (base 2) is a int-type array, whose row number is n; array equal 1 (equal 2) has n rows, its element can be stored in log 2 n 1 (log 2 n 2) bits we only statistic the ratio of effective elements in T 3, represented by r. National Cheng Kung University CSIE Computer & Internet Architecture Lab 16

EXPERIMENT RESULTS National Cheng Kung University CSIE Computer & Internet Architecture Lab 17

EXPERIMENT RESULTS National Cheng Kung University CSIE Computer & Internet Architecture Lab 17

EXPERIMENT RESULTS National Cheng Kung University CSIE Computer & Internet Architecture Lab 18

EXPERIMENT RESULTS National Cheng Kung University CSIE Computer & Internet Architecture Lab 18

ORTHOGONAL TO PREVIOUS SCHEMES National Cheng Kung University CSIE Computer & Internet Architecture Lab

ORTHOGONAL TO PREVIOUS SCHEMES National Cheng Kung University CSIE Computer & Internet Architecture Lab 19