Machine Learning for Security | CS 6262 Network Security

01 //

Data Analytics for Security

The goal of applying ML to intrusion detection is to automatically and quickly identify new attacks. Three main paradigms:

Anomaly Detection (A)

Deviation from Normal

Model normal network and system behavior; identify deviations from the norm. Can detect zero-day attacks; prone to false positives.

Misuse Detection (M)

Signature-Based

Detect known attacks using signatures. Can detect known attacks without many false positives; misses novel attacks.

Hybrid (H)

Both Approaches

Combination of misuse and anomaly detection.

02 //

Machine Learning Review

Task: given training examples, learn a function y = f(x). Training: minimize prediction error on labeled set {(x₁,y₁), …, (xₙ,yₙ)}. Testing: apply f to unseen examples.

Process

Data drawn from real world → split randomly → training set vs test set. Features extracted from raw data (e.g., image pixels, histograms, GIST). Generalization - model must work on new test data, not just memorize training.

ML Types

Type	Description
Supervised (S)	Task is to find a function or model that explains the (labeled) data.
Unsupervised (U)	Main task is to find patterns, structures, or knowledge in unlabeled data.
Semi-supervised (SS)	Some of the data is labeled during acquisition.

Performance Metrics

Measures

Error Rate - fraction of false predictions
Accuracy - fraction of correct predictions
Precision - fraction of correct predictions among all predicted positive
Recall - fraction of correct predictions among all real positive (generalizable to multi-class)

03 //

Classification & Decision Trees

Given records with attributes and class labels, build a model to classify future data. Decision trees repeatedly partition until each partition belongs to one class (or is small enough). Can be expressed as rules in Disjunctive Normal Form.

Example: Play Tennis

Features: Outlook, Temperature, Humidity, Wind. Class: PlayTennis (Yes/No). Rule form: (Outlook=Sunny ∧ Humidity=Normal) ∨ (Outlook=Overcast) ∨ (Outlook=Rain ∧ Wind=Weak) → Yes.

Entropy & Information Gain

Entropy E(S) = minimum bits to represent examples by class; roughly, purity. Max when evenly distributed; min when all one class.

G(S,A) = E(S) − Σ (|Sᵥ|/|S|) × E(Sᵥ)

Higher G(S,A) means attribute A better separates samples into purer subsets. Select attribute with highest gain.

Decision Tree for IDS

Can supplement honeypot analysis and penetration testing; can highlight malicious traffic; can characterize known scanning activity; can detect previously unknown network anomalies.

04 //

Clustering

Construct a partition of training examples that optimizes a distance/similarity function (e.g., Euclid or Mahalanobis). Examples in a cluster are more similar to each other than to examples in other clusters.

Process

1

Predetermine number of clusters; start with seed clusters (one element each)
2

Assign samples to clusters by distance
3

Find new centroids (center of each cluster)
4

Iterate until clusters converge (no more membership changes)

Test data assigned to cluster with shortest distance to centroid.

C&C Protocol Detection

For supervised C&C detection (high % correctly classified): need labeled data - known C&C communication examples.

Classifiers

No free lunch - ML algorithms are tools, not dogmas
Better smart features + simple classifiers than simple features + smart classifiers
Try simple classifiers first; use more powerful ones with more data (bias-variance tradeoff)

05 //

IDS Preprocessing & Feature Construction

Intrusion detection is a classification problem: partition mixed (normal + attack) traffic into pure subsets using features with high information gain.

Audit Data Preprocessing

Raw tcpdump packet data → summarized into connection records. Each record: time, duration, src, dst, bytes, service, flag (e.g., SF = SYN+FIN, REJ = rejected), etc.

Feature Construction Problem

Raw attributes (e.g., flag=S0) alone have low information gain. Constructed features like “percentage of S0 connections to same host” have high information gain and distinguish intrusions (e.g., syn flood).

Approach

Use temporal and statistical patterns - e.g., “many S0 connections to same service/host within a short time window.” Mine patterns, then construct features from them.

Pattern Mining

Association Rules

Correlations among features, e.g. (service=http, flag=S0). Basic algorithm: association rules.

Frequent Episodes (Sequential)

Sequential patterns, e.g. (http,S0) → (http,S0) [0.8, 2s]. Basic algorithm: frequent episodes.

Basic algorithms produce too many useless patterns. Use axis attributes (e.g., service) so patterns must describe essential features.

Axis & Reference Attributes

Axis attribute - most important (e.g., service). Patterns must contain axis attribute values.
Reference attribute - “reference subject” of a sequence (e.g., same destination host). Sequential patterns refer to the same reference value.

Use count, percent, and average operators to add temporal/statistical features.

Syn Flood Feature Example

Pattern: (flag=S0, service=http) with dst_host as reference. Construct:

Count connections to same dst_host in past 2 seconds
Among them: % same service, % S0

06 //

DARPA 1998 Evaluation

MIT Lincoln Lab dataset: 38 attack types in four categories. 40% of attack types appear only in test data (new to IDS).

Category	Description
DoS	Denial-of-service, e.g., SYN flood
Probing	Gathering info, e.g., port scan
r2l	Remote-to-local: illegally gaining local access
u2r	User-to-root: illegally gaining superuser

Feature Types

Intrinsic

Protocol, duration, flag, # wrong fragments, # urgent packets, same IP/port pair, etc.

Content

# failed logins, root shell, su attempted, # compromised states, # write access, # outbound commands, guest/root login, etc.

Traffic (Mined)

# connections to same dst_host in 2s; r_error, diff_srv_rate; same-service connections; % same dst_host; etc.

Example Rules (from ML)

buffer_overflow: hot ≥ 3, compromised ≥ 1, su_attempted ≤ 0, root_shell ≥ 1
back: compromised ≥ 1, protocol = http
smurf: protocol = ecr_i, count ≥ 5, srv_count ≥ 5
satan: r_error ≥ 3, diff_srv_rate ≥ 0.8

MADAM ID Framework

Mining Audit Data for Automated Models: uses data mining (association rules, frequent episodes) + ML to generate IDS rules. Axis/reference attributes constrain patterns. Features via count/percent/average. Replaces manual, hand-coded intrusion patterns.

07 //

Summary

ML for Security - Takeaways

Anomaly / misuse / hybrid - anomaly detects zero-day; misuse low FP on known attacks
Generalization - model must work on new test data
Decision trees - entropy, information gain; partition into pure classes
Clustering - seeds, centroids, distance; converge by iterating
Smart features - high information gain; axis/reference attributes for pattern mining
DARPA 1998 - 38 attacks, 4 categories; 40% new in test; MADAM ID framework