Machine Learning Model for Passive OS Fingerprinting

OS fingerprinting is the process of detecting a remote server's OS (and version) by communicating with it and analyzing its response. This process is important for security experts (and attackers), since knowing a server's OS reveals the server's security vulnerabilities.

The most common tools for fingerprinting (Nmap, NetworkMiner, Satori, p0f) rely on a database of "network signatures" (a signature can be thought of as the 'accent' or 'body language' of an OS). The database is maintained manually by security experts, and has not been updated in a long time (most tools rely on the database of p0f).

This project is an attempt to create an ML model for OS fingerprinting.

Background on OS Fingerprinting

There are 2 types of fingerprinting:

Active fingerprinting takes advantage of known security flaws: if there was a vulnerability in version X of the linux kernel, and it was fixed in version Y, then attempting to use the exploit will help us determine the server's kernel version ("exploit completed successfully" --> "server has version X"). Nmap is a common tool for active fingerprinting.
Passive fingerprinting only analyzes packets of 'typical/legitimate' communication (mainly the TCP/IP headers). p0f is a common tool for passive fingerprinting.

The trade-off between the two methods: the active method has better accuracy, but its 'aggressive' nature makes it much easier to detect by firewalls.

In this project my models perform the passive version. To be precise, they only look at the server's TCP SYN-ACK message, which makes the process extremely stealthy and fast.

Related Work: I found a paper written by IEEE researchers about a similar project:
A Machine Learning-based Tool for Passive OS Fingerprinting with TCP Flavor as a Novel Feature

Data Generation

I collected data on ~1,000,000 servers (chosen from a list of popular websites).

Establishing Ground Truth

Since I don't have a datacenter's-worth of my own servers, finding labeled servers felt like a 'chicken and egg' problem. I decided to use Nmap's analysis as my ground truth: it may not be 100% accurate, but it does harness the percision of active fingerprinting, and it's an industry standard.

Nmap's output usually claims to be of 85%-90% certainty. It returns a list of guesses in descending order of certainty. For this reason I aimed for 85%-90% accuracy with my models, and decided that the most relevant accuracy metric will be top-2 accuracy.

Feature Selection

I chose the features by reading p0f's documentation, the paper mentioned before and the RFC on TCP/IP headers.
Some of the most helpful fields are IP's "Dont Fragment" flag, IP's TTL value, TCP's MSS value, and TCP's options.

Data Collection

The process of retrieving labels and the process of retrieving features were run separately using different tools.

Label retrieval: Python has a wrapper for Nmap, so automating the scan was relatively trivial. Another advantage of Nmap is a built-in ability to concurrently scan multiple hosts.

Feature retrieval: to analyze a server's SYN-ACK message, I sent an HTTP request while sniffing the communication with Scapy (a sniffer & packet manipulation tool). I used multithreading to probe multiple hosts simultaneously.
(Initially I only sent a TCP SYN message, as it's simpler & faster than sending a full HTTP request. I noticed there was almost no variety in the response's TCP options, and suspected it may be due to the 'synthetic' nature of the probe. Switching to a full HTTP request resulted in the variety I was hoping for.)

My scan found the following operating systems:

OS	# Samples	OS	# Samples
Linux 5.X	12392	OpenBSD 4.X	7041
Linux 4.X	110824	FreeBSD 6.X	72072
Linux 3.X	88485	embedded	76809
Linux 2.6.X	50978	Windows 2016	6224
Linux (Other)	5634	Windows 2012	9014

Model Comparison

The Models:

SVM: in some of the features, different operating systems result in different value ranges (for example, Windows systems tend to have initial TTL of 128, while Linux systems tend to have initial TTL of 64). I believed this property might call for a linear classifier.

Gradient Boosting: this is simply a typical choice for tabular data.
Neural Network: adding this model was mostly for my own curiosity. The network has 4 fully-connected layers.

The Metric:
As I wrote under Establishing Ground Truth, the metric that fit my data is top-2 accuracy.
Note that it does not hinder user experience too much: receiving 2 guesses isn't so bad when looking for exploits.

The Results:
All 3 models reached a top-2 accuracy of around 85%. Graphs are available in the Model Training Notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
code		code
data		data
detect_os.py		detect_os.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning Model for Passive OS Fingerprinting

Background on OS Fingerprinting

Data Generation

Establishing Ground Truth

Feature Selection

Data Collection

Model Comparison

About

Releases

Packages

Languages

oopir/os_fingerprinting

Folders and files

Latest commit

History

Repository files navigation

Machine Learning Model for Passive OS Fingerprinting

Background on OS Fingerprinting

Data Generation

Establishing Ground Truth

Feature Selection

Data Collection

Model Comparison

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages