Implement ML based DGA detection. #1086

aouinizied · 2020-12-10T17:02:55Z

This PR introduces ML-based DGA detection.

As a first implementation, it's a Proof of Concept, and based on your feedback, I'm convinced that there is a place for several improvements.

Detection of DGA is moved to a decision tree that is based on several statistical features.
DGA detection sensitivity is configurable through the API call.
As you see it in the tests, despite some False Positives in the pcap tests, we see improvements on larger test cases. (accuracy: 0.72 -> 0.84, precision: 0.89 -> 0.94, recall: 0.50 -> 0.72)

Limitations that must be taken into account:

Currently, DGA detection is called from several dissectors, the trained model is trained on domains and thus, hostnames et other input leverages the need to extend the dataset I'm using.
The approach for features extraction is quite a dummy one: features are extracted on all the input. A better detection can be achieved if we use these statistics only on TLD, SLD, etc. This can be a hard task for domains like demo.google.co.uk etc. Moreover, point 1 must be solved before as we ensure that the model is called only on domains et not something else.

Regards,
Zied

daniele-sartiano · 2020-12-12T15:02:18Z

Hi @aouinizied
Have you considered a deep learning approach to solve this task?
A model based on character level embeddings and LSTM works pretty well. In this way, no feature extraction is needed, the input of the classifier is only the domain name.

Regards,
Daniele

aouinizied · 2020-12-12T15:34:47Z

@daniele-sartiano Yes Deep learning approaches achieved very high performances without feature extraction.
There are two reasons why I didn't use it for this PR:

Integration and additional dependencies for the nDPI project (tensorflow, tflite, etc.) which is not trivial and we need to make sure ntop community is OK with such a move.
Inference performances when nDPI is integrated into probe running at several Mpps. We can minimize overhead by post-training quantization and simplifying the architecture of the network, however, I considered it as advanced for a first PoC.

So, I mainly focused on introducing ML workflow to nDPI:

I started by adding a workflow to track DGA classification performances within nDPI CI (ML or classic) to ensure each pushed modification (example: model upgrade) does not harm the baseline performances.
Then, I simply trained a C4.5 decision tree, translated it to C, and optimized the features extraction process as the first version of this approach.
Despite being a simplistic C4.5, it provides better performances than the classic nDPI implementation. If we are sure there is no overhead and it's integrated into the master. Then, we can move to more complex models as I believe there is space for improvements.

If you have some references to Deep Learning DGA detection with evaluation on the field in terms of CPU, memory, maximum domain/sec I'm really interested in it.

Regards,
Zied

daniele-sartiano · 2020-12-12T17:13:41Z

If you have some references to Deep Learning DGA detection with evaluation on the field in terms of CPU, memory, maximum domain/sec I'm really interested in it.

I don't have this type of information, but it could be interesting to create a fast version of a DGA classifier with deep learning. I could create a c++ client (only inference). I will do some experiments.

Regards,
Daniele

Regards,
Zied

lucaderi · 2020-12-19T15:58:25Z

@aouinizied I have further analyzed your work and I have some qeustions

how is the ngi_dga_tree.c.inc generated? Where is the code used for training?
what are the licenses of the training data? Does it allow to be used to generate derivatives?
the feature extraction code should be, if possible, contained in its own
function or even better C file in a way that training and inference will
share the same feature extraction process. At this stage it is not clear
how to retrain (and evaluate) on a different dataset.

lucaderi · 2021-01-04T11:47:49Z

@aouinizied Please resubmit a new PR when you have time to review the above comments

Implement ML based DGA detection.

bfdb38c

aouinizied closed this Dec 12, 2020

aouinizied reopened this Dec 12, 2020

lucaderi closed this Jan 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ML based DGA detection. #1086

Implement ML based DGA detection. #1086

aouinizied commented Dec 10, 2020

daniele-sartiano commented Dec 12, 2020

aouinizied commented Dec 12, 2020 •

edited

Loading

daniele-sartiano commented Dec 12, 2020

lucaderi commented Dec 19, 2020

lucaderi commented Jan 4, 2021

Implement ML based DGA detection. #1086

Implement ML based DGA detection. #1086

Conversation

aouinizied commented Dec 10, 2020

daniele-sartiano commented Dec 12, 2020

aouinizied commented Dec 12, 2020 • edited Loading

daniele-sartiano commented Dec 12, 2020

lucaderi commented Dec 19, 2020

lucaderi commented Jan 4, 2021

aouinizied commented Dec 12, 2020 •

edited

Loading