With the development of internet, network security becomes more and more demanding in almost every field. In 2015, IBM and Ponemon Institute conducted research on the cost due to data breach in 62 companies. The average cost of data breach is $6.5 million [1]. Recently, there are many security events such as WannaCry ransomware attack, dark web and so on. The wannacry vulnerability has affected more than 3000,000 computers in 150 countries around the world. There are about 8 billion lost due to WannaCry ransomware attack. Many important organizations such as hospitals, universities and banks are attacked seriously. One of the key step is to infect machines or hosts inside the target network by spam email. In order to prevent the malicious intrusion into organizations or companies, it is really demanding to detect spam email accurately. Some conventional machine learning models are applied such as naïve Bayes classifier, support vector machine. However, the current email filter can not achieve a preferable accuracy which is really vital, since one successful intrusion may lead to security corruption of the whole company, even bankrupt of the company.
In this project, firstly, the email content is preprocessed and the language model called word embedding is applied. Then different deep learning frameworks are applied to detect spam email. In the experiment, the performance of those models is compared, including Naïve Bayes classifier, CNN (convolutional neural network), RNN (recurrent neural network), LSTM (Long Short-Term Memory Networks) and CNN-LSTM. Results on CSDMC2010 SPAM corpus [2] indicate that stacked LSTM achieved 99.20% detection accuracy, which is much better than other machine learning algorithms for the problem.