The goal of applying ML to intrusion detection is to automatically and quickly identify new attacks. Three main paradigms:
Model normal network and system behavior; identify deviations from the norm. Can detect zero-day attacks; prone to false positives.
Detect known attacks using signatures. Can detect known attacks without many false positives; misses novel attacks.
Combination of misuse and anomaly detection.
Task: given training examples, learn a function y = f(x).
Training: minimize prediction error on labeled set
{(x₁,y₁), …, (xₙ,yₙ)}. Testing: apply
f to unseen examples.
Data drawn from real world → split randomly → training set vs test set. Features extracted from raw data (e.g., image pixels, histograms, GIST). Generalization - model must work on new test data, not just memorize training.
| Type | Description |
|---|---|
| Supervised (S) | Task is to find a function or model that explains the (labeled) data. |
| Unsupervised (U) | Main task is to find patterns, structures, or knowledge in unlabeled data. |
| Semi-supervised (SS) | Some of the data is labeled during acquisition. |
Given records with attributes and class labels, build a model to classify future data. Decision trees repeatedly partition until each partition belongs to one class (or is small enough). Can be expressed as rules in Disjunctive Normal Form.
Features: Outlook, Temperature, Humidity, Wind. Class: PlayTennis
(Yes/No). Rule form:
(Outlook=Sunny ∧ Humidity=Normal) ∨ (Outlook=Overcast) ∨
(Outlook=Rain ∧ Wind=Weak)
→ Yes.
Entropy E(S) = minimum bits to represent examples by class; roughly, purity. Max when evenly distributed; min when all one class.
Higher G(S,A) means attribute A better separates samples into purer subsets. Select attribute with highest gain.
Can supplement honeypot analysis and penetration testing; can highlight malicious traffic; can characterize known scanning activity; can detect previously unknown network anomalies.
Construct a partition of training examples that optimizes a distance/similarity function (e.g., Euclid or Mahalanobis). Examples in a cluster are more similar to each other than to examples in other clusters.
Test data assigned to cluster with shortest distance to centroid.
For supervised C&C detection (high % correctly classified): need labeled data - known C&C communication examples.
Intrusion detection is a classification problem: partition mixed (normal + attack) traffic into pure subsets using features with high information gain.
Raw tcpdump packet data → summarized into
connection records. Each
record: time, duration, src, dst, bytes, service, flag (e.g., SF =
SYN+FIN, REJ = rejected), etc.
Raw attributes (e.g., flag=S0) alone have low information gain. Constructed features like “percentage of S0 connections to same host” have high information gain and distinguish intrusions (e.g., syn flood).
Use temporal and statistical patterns - e.g., “many S0 connections to same service/host within a short time window.” Mine patterns, then construct features from them.
Correlations among features, e.g.
(service=http, flag=S0). Basic algorithm: association
rules.
Sequential patterns, e.g.
(http,S0) → (http,S0) [0.8, 2s]. Basic algorithm:
frequent episodes.
Basic algorithms produce too many useless patterns. Use axis attributes (e.g., service) so patterns must describe essential features.
Use count, percent, and average operators to add temporal/statistical features.
Pattern: (flag=S0, service=http) with
dst_host as reference. Construct:
dst_host in past 2 seconds
MIT Lincoln Lab dataset: 38 attack types in four categories. 40% of attack types appear only in test data (new to IDS).
| Category | Description |
|---|---|
| DoS | Denial-of-service, e.g., SYN flood |
| Probing | Gathering info, e.g., port scan |
| r2l | Remote-to-local: illegally gaining local access |
| u2r | User-to-root: illegally gaining superuser |
Protocol, duration, flag, # wrong fragments, # urgent packets, same IP/port pair, etc.
# failed logins, root shell, su attempted, # compromised states, # write access, # outbound commands, guest/root login, etc.
# connections to same dst_host in 2s; r_error, diff_srv_rate; same-service connections; % same dst_host; etc.
Mining Audit Data for Automated Models: uses data mining (association rules, frequent episodes) + ML to generate IDS rules. Axis/reference attributes constrain patterns. Features via count/percent/average. Replaces manual, hand-coded intrusion patterns.