Machine Learning Based Applications Classification Technologies All rights
Machine Learning Based Applications Classification Technologies Ó All rights reserved. No part of this publication and file may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior written permission of Professor Nen-Fu Huang (E-mail: nfhuang@cs. nthu. edu. tw). ML-1
Agenda n Introduction to Applications identification n Machine Learning Based Technologies n Machine Learning based Applications Identification Technologies ML-2
Application Traffic identification(or traffic classification) issues are focused in recently years since: The introducing of P 2 P application greatly impacts the network management task. l Port number is not the best and efficient discriminator to identify these prevalent traffics. l l How about string matching method? Accurate! But… 4 It cannot identify the encrypted traffic. 4 High cost on manually maintenance work for protocol signatures. 4 High cost to match string in very high speed network. 4 Privacy issue is under debating. ML-3
Application Traffic Identification n Identification at different levels Category level or Qo. S class (Bulk data transfer FTP&P 2 P, interactive, mail, web, streaming) l Protocol level (Kazza, e. Mule/e. Donkey, Bittorrent, MSN, FTP, POP 3, SMTP, HTTP, Skype, Winny, Share, …. ) l Behavior level (FTP control, FTP data, MSN file transfer, MSN message chatting, MSN voip, Skype Chatting, Skype voip, Skype File transfer, Skype Video conference, …) l Most existing researches focus on classification in protocol or category level. l ML-4
Application Traffic Identification n Utilization target l Offline based: 4 traffic trend analysis. l Online based: 4 traffic shaping, 4 traffic engineering, 4 security management. ML-5
How to resolve the problem? n Heuristics methods(2004~2005) l Based on some intrinsically different behavior, some rule can be constructed. 4 E. g. # dest IPs = # of dest ports the host is running P 2 P. l To differentiate P 2 P or non-P 2 P traffic. n Machine learning based techniques: (2004 ~ ) l To construct the “statistical signatures” for different categories/application protocols. l Most machine learning techniques are directly employed to construct traffic signature. ML-6
The Milestone of Researches on Application Traffic Identification n Before 2003: String matching and port number. n 2003~2005: l Heuristics l Machine learning method. n 2006~ : Machine learning method for real-time based traffic classification. l First k data packet sizes and direction of TCP connection. l Stage-based classification(Statistical data in each stage) ML-7
The Classes of Applied Machine Learning Algorithms n Supervised-Machine learning The model of traffic characteristics is constructed from the training instances with previously defined class label. n Unsupervised-Machine learning (Clustering) l The model of traffic characteristics is constructed from the training instances without previously defined class label. n However, all the existing training set employed by both include pre-classified label. l Because each cluster would contain several different classes/protocols. l ML-8
The Discriminators (Attributes) n The key issues for machine-learning based traffic identification are: l What are the most distinguishable characteristics (attributes/discriminators)? l How to remove the expensive cost on training? n Different discriminators: l From L 3/L 4 layer—packet inter-arrival time, total packet size, number of packets, …, etc. l Combination of L 3/L 4 attributes with different perspectives. e. g. upload/download size ratio. ML-9
Data Clustering Methods n Data Clustering or Partitioning Methods l Support Vector Machine (SVM) l Nearest Neighbor、 l Linear Discriminant Analysis (LDA) l K-means l Neural Networks、 l Decision Tree ML-10
Nearest Neighbor n To classify a data point x, let’s find the nearest neighbor! l The points with same property should be closely. l The class of the nearest neighbor will be assigned to the data point x. n K- Nearest Neighbor: l To find the k nearest neighbors and let them “vote”. ML-13
K-Nearest Neighbor X Stored training set patterns X input pattern for classification --- Euclidean distance measure to the nearest three patterns ML-14
K-Nearest Neighbor Algorithm Store all input data in the training set For each pattern in the test set Search for the K nearest patterns to the input pattern using a Euclidean distance measure For classification, compute the confidence for each class as Ci /K, (where Ci is the number of patterns among the K nearest patterns belonging to class i. ) The classification for the input pattern is the class with the highest confidence. ML-15
Linear Discriminant Analysis (LDA) n To find the good “projection” for original points. n Linear discriminant analysis finds a linear transformation ("discriminant function") of the two predictors, X and Y, that yields a new set of transformed values that provides a more accurate discrimination than either predictor alone: Transformed Target = C 1*X + C 2*Y 3 features More information: http: //www. dtreg. com/lda. htm http: //neural. cs. nthu. edu. tw/jang/books/dcpr/index. asp ML-16
Linear Discriminant Analysis 2 features More information: http: //www. dtreg. com/lda. htm http: //neural. cs. nthu. edu. tw/jang/books/dcpr/index. asp ML-17
LDA Evaluation Example n Attributes for this evaluation: the average packet size, flow duration, bytes per flow, packets per flow, and Root Mean Square (RMS) packet size. 18 ML-18
LDA Evaluation Example ML-19
K-means Clustering n For given number of clusters k, to iteratively find k centers of these k clusters and “partition” all the points into these k clusters until the nearest center does not change. n Each data point is expressed as a vector, and Euclidean distance is the most common distance computation function. ML-20
Decision trees n Decision trees are popular for pattern recognition because the models they produce are easier to understand. Root node A. Nodes of the tree C A B. Leaves (terminal nodes) of the tree A C. Branches (decision point) of the tree B B ML-21
Decision trees -Binary decision trees n Classification of an input vector is done by traversing the tree beginning at the root node, and ending the leaf. n Each node of the tree computes an inequality (ex. BMI<24, yes or no) based on a single input variable. n Each leaf is assigned to a particular class. BMI<24 No Yes Yes No ML-22
A High Accurate Machine-Learning Algorithm for Identifying Application Traffic in Early Stage Nen-Fu Huang+ , Gin-Yuan Jai+, and Han-Chieh Chao 1 +Department of Computer Science, National Tsing Hua University, Taiwan 1 National Ilan University, Taiwan IEEE ICC 2008, Information Sciences, 2013 (SCI)
Application Rounds-Example ML-24
Application Rounds-Concepts n Talk block Tij n Round Ii ML-25
Attributes-Interaction Round Discriminators Inter-arrival time between In and In+1 ith Interaction Response time between Ti. A and Ti. B of Ii Round Ii Number of total bytes/packets sent during Ii Data throughput of bytes/packets during Ii ML-26
Attributes-TALK block Discriminators Average inter-arrival time of packets of Tij TALK block Tij of Interaction Round Ii ( j = A or B ) Number of total bytes/packets sent during Tij Elapsed time of Tij Data throughput of bytes/packets during Tij ML-27
Attributes-L 7 features Discriminators Number of total bytes/packets sent by L 7 layer features initializer or listener during first k rounds of a TCP/UDP Sum of elapsed time from T 1 A to Tk. A flow Sum of elapsed time from T 1 B to Tk. B Data throughput of bytes/packets during first k rounds (individually initializer or listener) Data throughput of bytes/packets during first k rounds (sum of initializer and listener) ML-28
Attributes-L 4 features Discriminators Initializer’s and Listener’s Port number Which side sends the first data packet L 4 layer features of a (the flag defines the initialize, can be TCP/UDP flow client or server) L 4 protocol (TCP or UDP) ML-29
Experiment Architecture Traffic Dump (payload included) Protocol signature Result Flow Preprocessing Flow Sets Flow Sampling Random Split Average … … … Machine Learning Training Sets 10 Test Sets 1 … Training Sets 1 Result 10 Sample Set Test Sets 10 10 -fold cross validation ML-30
Traffic Traces n NIU Traffic: l 2007/01/30 13: 25 to 2007/01/31 13: 25 l Separated into two sub-traffic traces 4 NIU-1, NIU-2 n NTHU Traffic: l Captured at two PC rooms l Simulated user behaviors l CS 326: NTHU-1, 2008/11/18 15: 16 -17: 20 l CS 328: NTHU-2, 2008/11/18 15: 45 -17: 16 ML-31
Building the “Background Truth” n For evaluation purpose, the string and port matching methods are adopted to get the “real class” of each flow. l String pattern: 4 L 7 filter and some rules from testing and related works. l Port rule: 4 IANA port assignment: for traditional protocol 4 Specific server port number: MSN, Yahoo, etc. ML-32
Traffic Sampling n The proportion of sampled flows for each application l original ratio sampling: to reflect the original proportion l fixed-number sampling: to have a more balanced view for learning accuracy n The transport layer protocols considered for sampling l TCP+UDP, TCPonly, UDPonly n The number of first APPR rounds considered for traffic classification l R 1(first round), R 2(first two rounds), R 3, and R 4. ML-33
Overall Accuracy Several Machine Learning Algorithms ML-34
The Influence of Feature Selection Overall Accuracy ML-35
The Influence of Feature Selection Model Building Time (Training) ML-36
The Influence of Feature Selection Model Testing Time (Testing) ML-37
Overall Accuracy Comparison n The supervised machine learning l SVM for [36]: icc 07 -svm n The unsupervised machine learning l Kmeans for [5]: acm 5 pkt-kmeans l GMM for [6]: conext-proto-Dominant, conext- proto-Port n The semi-supervised machine learning l Kmeans for [17]: jeff. L 0 -kmeans, jeff. L 1 -kmeans ML-38
Overall Accuracy Compared with Other Related Works ML-39
Overall Accuracy Compared with Other Related Works ML-40
Conclusions n The proposed classifier achieved high accuracy for J 48, PART, and Bayesian network algorithms. n The combined selected attributes, including port number, transport layer protocol id, and size information, offers an acceptable test time and overall accuracy. ML-41
Conclusions n Machine learning based techniques to identify the Network Applications are more and more important. n Focus on real-time based, protocol level requirement of application traffic classification. n No existing common traffic traces provided for comparing the performance in the same base line. n Expensive training is still a problem. n Identifying encrypted traffic (e. g. Skype, Winny, Encrypted BT) is a new challenge. n Identifying detailed behaviors of encrypted traffic is even a big challenge. ML-42
- Slides: 42