Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ML based DGA detection. #1086

Closed
wants to merge 1 commit into from

Conversation

aouinizied
Copy link
Collaborator

This PR introduces ML-based DGA detection.

As a first implementation, it's a Proof of Concept, and based on your feedback, I'm convinced that there is a place for several improvements.

  • Detection of DGA is moved to a decision tree that is based on several statistical features.
  • DGA detection sensitivity is configurable through the API call.
  • As you see it in the tests, despite some False Positives in the pcap tests, we see improvements on larger test cases. (accuracy: 0.72 -> 0.84, precision: 0.89 -> 0.94, recall: 0.50 -> 0.72)

Limitations that must be taken into account:

  • Currently, DGA detection is called from several dissectors, the trained model is trained on domains and thus, hostnames et other input leverages the need to extend the dataset I'm using.
  • The approach for features extraction is quite a dummy one: features are extracted on all the input. A better detection can be achieved if we use these statistics only on TLD, SLD, etc. This can be a hard task for domains like demo.google.co.uk etc. Moreover, point 1 must be solved before as we ensure that the model is called only on domains et not something else.

Regards,
Zied

@daniele-sartiano
Copy link

Hi @aouinizied
Have you considered a deep learning approach to solve this task?
A model based on character level embeddings and LSTM works pretty well. In this way, no feature extraction is needed, the input of the classifier is only the domain name.

Regards,
Daniele

@aouinizied
Copy link
Collaborator Author

aouinizied commented Dec 12, 2020

@daniele-sartiano Yes Deep learning approaches achieved very high performances without feature extraction.
There are two reasons why I didn't use it for this PR:

  • Integration and additional dependencies for the nDPI project (tensorflow, tflite, etc.) which is not trivial and we need to make sure ntop community is OK with such a move.
  • Inference performances when nDPI is integrated into probe running at several Mpps. We can minimize overhead by post-training quantization and simplifying the architecture of the network, however, I considered it as advanced for a first PoC.

So, I mainly focused on introducing ML workflow to nDPI:

  • I started by adding a workflow to track DGA classification performances within nDPI CI (ML or classic) to ensure each pushed modification (example: model upgrade) does not harm the baseline performances.
  • Then, I simply trained a C4.5 decision tree, translated it to C, and optimized the features extraction process as the first version of this approach.
  • Despite being a simplistic C4.5, it provides better performances than the classic nDPI implementation. If we are sure there is no overhead and it's integrated into the master. Then, we can move to more complex models as I believe there is space for improvements.

If you have some references to Deep Learning DGA detection with evaluation on the field in terms of CPU, memory, maximum domain/sec I'm really interested in it.

Regards,
Zied

@aouinizied aouinizied closed this Dec 12, 2020
@aouinizied aouinizied reopened this Dec 12, 2020
@daniele-sartiano
Copy link

If you have some references to Deep Learning DGA detection with evaluation on the field in terms of CPU, memory, maximum domain/sec I'm really interested in it.

I don't have this type of information, but it could be interesting to create a fast version of a DGA classifier with deep learning. I could create a c++ client (only inference). I will do some experiments.

Regards,
Daniele

Regards,
Zied

@lucaderi
Copy link
Member

@aouinizied I have further analyzed your work and I have some qeustions

  • how is the ngi_dga_tree.c.inc generated? Where is the code used for training?
  • what are the licenses of the training data? Does it allow to be used to generate derivatives?
  • the feature extraction code should be, if possible, contained in its own
    function or even better C file in a way that training and inference will
    share the same feature extraction process. At this stage it is not clear
    how to retrain (and evaluate) on a different dataset.

@lucaderi
Copy link
Member

lucaderi commented Jan 4, 2021

@aouinizied Please resubmit a new PR when you have time to review the above comments

@lucaderi lucaderi closed this Jan 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants