Learning to apply machine learning to the KDD CUP 99 data set
by Security Dude
DFIR can easily be overwhelmed by the amount of data produced on the network. Security devices are data generators. Computer devices spit out logs. Applications hold our data. We discard data because there is so much of it to deal with. Big Data is here but its not going to help immediately because the analysis tools are in their infancy.
Security analysts need better tools to sift through the data to identify bad actors and attacks. It makes me laugh when people think that computers can do all the heavy lifting and people can sit around and drink coffee all day. They have this seductive vision of automation and they believe that technology will solve all their network security issues. IMHO:
Human analysts will always be needed to monitor the automated system. Identify new categories of attacks, and to analyze the more sophisticated attacks
We will use one or more of the following techniques to help analysts identify attackers:
- Data summarization with statistics, including finding outliers
- Visualization: presenting a graphical summary of the data for analysis
- Clustering of the data into categories
- Association rule discovery: defining normal activity and enabling the discovery of anomalies
- Classification: predicting the category to which a particular record belongs
We (Corey and I) will use 21+ learned machines to label the records of the entire KDD train and test sets. Hopefully giving us which 21+ predicted labels for each record. My calculations as follows: 7+ learners trained 3 times with each different train sets. Some of the research suggests to use a multi-expert classification system for the KDD CUP 99 dataset. http://nsl.cs.unb.ca/NSL-KDD/ Corey suggests that we try using Random forests (wikipedia below):
Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and “Random Forests” is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman’s “bagging” idea and the random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variation.
The simulated attacks fall in one of the following four categories:
- Denial of Service Attack (DoS): is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate re- quests, or denies legitimate users access to a machine.
- User to Root Attack: is a class of exploit in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
- Remote to Local Attack: occurs when an attacker who has the ability to send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine.
- Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.
KDD’99 features can be classified into three groups:
- Basic features
- Traffic features
- Content features
KDD Cup 1999. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Loading KDD Cup 1999 data into python pandas dataframe
Careful as this script will DOS you system. Loading is intensive and reminds you of what a slow machine can feel like. Loading on my dual SDD 2009 MacBook Pro with 8GB of RAM.
|Supervised Training||Combine Unsupervised- Supervised Training||Unsupervised Training (Clustering)|
|Radial Basis Function (RBF)
Learning Vector Quantizer (LVQ)
Nearest Cluster Classifier
Fuzzy ARTMap Classifier
|Gaussian Linear Discriminant
Binary Decision Treee
Support Vector Machine
|Gaussian Mixture Classifier
– Diagonal/Full Covariance
– Tied/Per-Class Centers
|Feature Selection Algorithms||Linear Discriminant (LDA)
|Principal Components (PCA)|
Building you own dataset
Tcpreplay is a suite of
GPLv3 licensed tools written by Aaron Turner for UNIX (and Win32 under Cygwin) operating systems which gives you the ability to use previously captured traffic in libpcap format to test a variety of network devices. It allows you to classify traffic as client or server, rewrite Layer 2, 3 and 4 headers and finally replay the traffic back onto the network and through other devices such as switches, routers, firewalls, NIDS and IPS’s. Tcpreplay supports both single and dual NIC modes for testing both sniffing and inline devices.
Tcpreplay is used by numerous firewall, IDS, IPS and other networking vendors, enterprises, universities, labs and open source projects.
NSL-KDD Dataset http://iscx.ca/NSL-KDD/
Book: Data Mining: Practical Machine Learning Tools and Techniques
Book: Data Mining: Concepts and Techniques