An Interview with Deep Instinct’s CTO, Dr. Eli David
Deep Instinct’s CTO, Dr. Eli David, one of the leading researchers in the field of computational intelligence, presented at the RE.WORK Deep Learning Summit in San Francisco this past January. Below are excerpts from an interview held for the event by Nikita Johnson, Founder of RE.WORK.
What are the key factors that have enabled recent advancements in deep learning?
The key drivers are a combination of both algorithmic and hardware improvements. Due to numerous recent advances in methods for understanding and training deep neural networks, we are capable of training deep networks with tens of hidden layers. Fast GPUs allow the training time for such networks to be reduced by many tens of times. The combination of these two factors enable us to train large neural nets with billions of synapses. I have been teaching neural networks for the past decade, and yet, until a few years ago I would have considered training such huge networks inconceivable.
How are you currently applying deep learning to cybersecurity?
The most significant improvement that deep learning provides over classical machine learning is that you no longer need to do feature engineering. If you would like to use machine learning for computer vision, you need image processing experts to tell you what are the few (tens or hundreds) of most important features in an image. But if you use deep learning for computer vision, you just feed in the raw pixels, without caring much for image processing or feature extraction, and of course you obtain 20-30 percent improvement in accuracy in most computer vision benchmarks.
The same holds true for cybersecurity, and specifically for detection of new malware and APT (advanced persistent threats, the most advanced malware). Since signature-based methods are completely incapable of detecting new malware, current solutions for detection of new malware rely mostly on manually tuned heuristics. Several more advanced solutions use manually selected features, which are then fed into classical machine learning modules. Additionally, most current methods rely on running the malware in a sandbox environment to obtain more information about it, thus allowing more accurate detection. This comes at a cost of doing “detection” only, rather than “prevention”, since sandboxing is typically a very time consuming process.
At Deep Instinct we are not trying to create any manual heuristics, and we do not care about finding smart features either. Rather, in a very similar vein to the way deep learning is applied to other domains, we feed datasets of many millions of malicious and legitimate files into our deep learning infrastructure (which is developed in-house in order to obtain an efficient infrastructure that can run deep networks even where limited resources are available, such as on mobile devices). That is, we rely on deep learning to learn for itself the useful high level non-linear features necessary for accurate classification. Additionally, we look only at the static level of files and do not use any sandboxing. This allows us to provide real-time detection and prevention (only a few milliseconds for prediction, even on the slowest mobile devices), all the while running the deep learning module in prediction mode on the device (i.e., connectionless, without the need to send the suspected file to a cloud or appliance). Finally, due to the input-agnostic nature of deep learning, we are capable of detecting any malicious file type (e.g., EXE, DLL, PDF, DOC, Android APK, etc.). Again, since we do not care about the specific structures of different file types.
What are the main problems being solved by the application of deep learning to cybersecurity?
Assume that I show you a clear image of a dog, and assume that I modify just a few percent of the pixels in the image. You will still easily recognize that there is a dog in the image. This is not the case if you were a cybersecurity solution. There are many hundreds of thousands of new malware introduced every day, and nearly all of them are based on extremely small mutations over the past known malware (by some estimates the vast majority of new malware are mutated by less than two percent in comparison to past malware). Yet, the current security solutions are completely incapable of detecting most of them. They look at the file, check its signature which doesn’t match anything they know, and then the advanced solutions run it in sandboxing mode for many seconds or minutes, using heuristics or even classical machine learning to get an idea of whether it is malicious or legitimate (and even then the detection rates are abysmal). A vivid demonstration of this incapability of current solutions in dealing with new malware is a live demonstration we usually present, where we take several well-known malicious files and allow the leading cybersecurity solutions to provide their predictions on them. Most of the solutions detect all of the files (since they had ample time to manually sign them). Next, we run an automatic “mutator” we have developed which just modifies some non-functional parts of the malicious files to create new mutations, which retain the same malicious behavior of the original files (the “mutator” changes things such as strings, etc., and not any functional part of the program). Once the other cybersecurity solutions are run on the mutated set of malware, they are mostly incapable of detecting even a single one of them.
Deep learning creates high level invariant representations of objects and is resilient to changes in the raw data (e.g., the high level representation of an elephant would remain largely unchanged even if the raw pixels of the image contain completely different views of the elephant). Using the same principle in cybersecurity, we train deep networks on datasets of millions of files such that invariant representations are learned. Thus, when in real-world, our deep net is given a mutated or modified file, it easily recognizes its true nature (malicious or legitimate). To further encourage this invariant property, during training time we intentionally modify the dataset in each epoch (introducing tens of percentages of mutations to each file). This makes the training phase much more difficult, of course, but when completed, the resulting trained network is highly resilient to changes and mutations and detects nearly all new malware variants.
On a typical benchmark of new malware and advanced APT, Deep Instinct obtains around 99 percent detection rate. We constantly compare our results with those of over sixty different cyber security solutions, including some that use advanced heuristics and classical machine learning. On these new malware benchmarks, the best solutions after us obtain below 80 percent detection, with most of them yielding a 20-40 percent detection rate. This 20 percent detection rate increase we observe in cybersecurity is fairly consistent with improvements achieved by deep learning in other fields (e.g., in ImageNet the classification error dropped from 25 percent without deep learning to under four percent using deep learning).
Another major advantage of applying deep learning is the real-time prediction it provides, with just a few milliseconds to feed in a raw file and pass it through the deep neural network to obtain the prediction. This allows us to provide not only detection but also prevention in all cases (the moment a malicious file is detected; it is already removed as well). Our brain works in a similar way as well; it takes us a long time to learn something, but once we learn it, we can use it very quickly in prediction mode. Other cybersecurity solutions, which attempt to detect new malware, usually rely on heavy sandboxing, which often provides only a post-mortem detection. As a side note, this also explains the name of our company, Deep Instinct, since we rely on deep learning and the reaction times are always in real-time, like an instinct.
What developments can we expect to see in deep learning and cybersecurity in the next 5 years?
Over the past two years we observed an accelerated success of deep learning in most areas it is applied to. Even if we do not achieve the Holy Grail of human level cognition within the next five years (though this will most probably happen in our lifetime), we will see huge improvements in many additional domains. Specifically, I think the most promising area will be unsupervised learning, as most of the data in the world is unlabeled, and our own brain’s neocortex is primarily a very good unsupervised learning box.
While Deep Instinct is the first company using deep learning for cybersecurity, I would expect more companies to employ it in the upcoming years. However, the barrier of entry to deep learning is still quite high, especially for cybersecurity companies which do not typically using AI methods (e.g., only a few cybersecurity solutions use classical machine learning), so it will take a few more years until deep learning becomes a commodity technology of widespread use within cybersecurity.
What advancements excite you most in the field?
I have always been fascinated by the way our brain works, and the amazing feats it is capable of accomplishing. I believe that our brain is a Turing machine (a view shared by most AI researchers), and as such there is no theoretical limitation to creating an artificial version of it, which is as good or much better than the “carbon-based” version. Deep learning is bringing us closer to this goal at a great and accelerating pace. I foresee many exciting breakthroughs in the upcoming years, especially in unsupervised learning. While deep learning has successfully been applied to computer vision, speech, and text understanding, there are many other challenging domains which deep learning can potentially revolutionize. Based on this belief we founded Deep Instinct, and the results we obtained have vindicated it. I expect many additional areas to be revolutionized by deep learning, especially fields which involve large amounts of complex data (e.g., finance and medicine are two prime examples for such domains).