Learning to apply machine learning to the KDD CUP 99 data set

by Security Dude

Problem Statement

DFIR can easily be overwhelmed by the amount of data produced on the network. Security devices are data generators. Computer devices spit out logs. Applications hold our data. We discard data because there is so much of it to deal with. Big Data is here but its not going to help immediately because the analysis tools are in their infancy.

Security analysts need better tools to sift through the data to identify bad actors and attacks.  It makes me laugh when people think that computers can do all the heavy lifting and people can sit around and drink coffee all day. They have this seductive vision of automation and they believe that technology will solve all their network security issues. IMHO:

Human analysts will always be needed to monitor the automated system. Identify new categories of attacks, and to analyze the more sophisticated attacks

We will use one or more of the following techniques to help analysts identify attackers:

  • Data summarization with statistics, including finding outliers
  • Visualization: presenting a graphical summary of the data for analysis
  • Clustering of the data into categories
  • Association rule discovery: defining normal activity and enabling the discovery of anomalies
  • Classification: predicting the category to which a particular record belongs

Proposed Methodology

We (Corey and I) will use 21+ learned machines to label the records of the entire KDD train and test sets. Hopefully giving us which 21+ predicted labels for each record.  My calculations as follows: 7+ learners trained 3 times with each different train sets. Some of the research suggests to use a multi-expert classification system for the KDD CUP 99 dataset. http://nsl.cs.unb.ca/NSL-KDD/ Corey suggests that we try using Random forests (wikipedia below):

Random forest (or random forests) is an ensemble classifier that consists of many decision trees and outputs the class that is the mode of the classes output by individual trees. The algorithm for inducing a random forest was developed by Leo Breiman and Adele Cutler, and “Random Forests” is their trademark. The term came from random decision forests that was first proposed by Tin Kam Ho of Bell Labs in 1995. The method combines Breiman’s “bagging” idea and the random selection of features, introduced independently by Ho and Amit and Geman in order to construct a collection of decision trees with controlled variation.

Dataset Overview

The simulated attacks fall in one of the following four categories:

  1. Denial of Service Attack (DoS): is an attack in which the attacker makes some computing or memory resource too busy or too full to handle legitimate re- quests, or denies legitimate users access to a machine.
  2. User to Root Attack: is a class of exploit in which the attacker starts out with access to a normal user account on the system (perhaps gained by sniffing passwords, a dictionary attack, or social engineering) and is able to exploit some vulnerability to gain root access to the system.
  3. Remote to Local Attack: occurs when an attacker who has the ability to send packets to a machine over a network but who does not have an account on that machine exploits some vulnerability to gain local access as a user of that machine.
  4. Probing Attack: is an attempt to gather information about a network of computers for the apparent purpose of circumventing its security controls.

Dataset Features

KDD’99 features can be classified into three groups:

  1. Basic features
  2. Traffic features
  3. Content features

KDD Cup 1999.  http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html

Loading KDD Cup 1999 data into python pandas dataframe

Careful as this script will DOS you system. Loading is intensive and reminds you of what a slow machine can feel like. Loading on my dual SDD 2009 MacBook Pro with 8GB of RAM.


Collection of Machine Learning Chart

Supervised Training Combine Unsupervised- Supervised Training Unsupervised Training (Clustering)
Back-Propagation (MLP)
Hypersphere Classifier
Radial Basis Function (RBF)
Incremental RBF
Learning Vector Quantizer (LVQ)
Nearest Cluster Classifier
Fuzzy ARTMap Classifier
and Machine 
Gaussian Linear Discriminant
Gaussian Quadratic
K-Nearest Neighbor
Binary Decision Treee
Parzen Window
Naive Bayes
Support Vector Machine
Gaussian Mixture Classifier
– Diagonal/Full Covariance
– Tied/Per-Class Centers
K-Means Clustering
EM Clustering
Leader Clustering
Random Clustering
Feature Selection Algorithms Linear Discriminant (LDA)
Forward/Backward Search
Principal Components (PCA)

Building you own dataset

Tcpreplay is a suite of GPLv3 licensed tools written by  Aaron Turner for UNIX (and Win32 under  Cygwin) operating systems which gives you the ability to use previously captured traffic in  libpcap format to test a variety of network devices. It allows you to classify traffic as client or server, rewrite Layer 2, 3 and 4 headers and finally replay the traffic back onto the network and through other devices such as switches, routers, firewalls, NIDS and IPS’s. Tcpreplay supports both single and dual NIC modes for testing both sniffing and inline devices.

Tcpreplay is used by numerous firewall, IDS, IPS and other networking vendors, enterprises, universities, labs and open source projects.





DARPA Dataset http://www.ll.mit.edu/mission/communications/cyber/CSTcorpora/ideval/data/

NSL-KDD Dataset http://iscx.ca/NSL-KDD/








Book: Data Mining and Machine Learning in Cybersecurity

Book: Data Mining: Practical Machine Learning Tools and Techniques

Book: Data Mining: Concepts and Techniques