Contrary to what you may have read, machine learning (ML) isn’t magic pixie dust. In general, ML is good for narrowly scoped problems with huge datasets available, and where the patterns of interest are highly repeatable or predictable. Most security problems neither require nor benefit from ML. Many experts, including the folks at Google, suggest that when solving a complex problem you should exhaust all other approaches before trying ML.
ML is a broad collection of statistical techniques that allows us to train a computer to estimate an answer to a question even when we haven’t explicitly coded the correct answer. A well-designed ML system applied to the right type of problem can unlock insights that would not have been attainable otherwise.
A successful ML example is natural language processing (NLP). NLP allows computers to “understand” human language, including things like idioms and metaphors. In many ways, cybersecurity faces the same challenges as language processing. Attackers may not use idioms, but many techniques are analogous to homonyms, words that have the same spelling or pronunciations but different meanings. Some attacker techniques likewise closely resemble actions a system administrator might take for perfectly benign reasons.
IT environments vary across organizations in purpose, architecture, prioritization, and risk tolerance. It’s impossible to create algorithms, ML or otherwise, that broadly address security use cases in all scenarios. This is why most successful applications of ML in security combine multiple methods to address a very specific issue. Good examples include spam filters, DDoS or bot mitigation, and malware detection.
The biggest challenge in ML is availability of relevant, usable data to solve your problem. For supervised ML, you need a large, correctly labeled dataset. To build a model that identifies cat photos, for example, you train the model on many photos of cats labeled “cat” and many photos of things that aren’t cats labeled “not cat.” If you don’t have enough photos or they’re poorly labeled, your model won’t work well.
In security, a well-known supervised ML use case is signatureless malware detection. Many endpoint protection platform (EPP) vendors use ML to label huge quantities of malicious samples and benign samples, training a model on “what malware looks like.” These models can correctly identify evasive mutating malware and other trickery where a file is altered enough to dodge a signature but remains malicious. ML doesn’t match the signature. It predicts malice using another feature set and can often catch malware that signature-based methods miss.
However, because ML models are probabilistic, there’s a trade-off. ML can catch malware that signatures miss, but it may also miss malware that signatures catch. This is why modern EPP tools use hybrid methods that combine ML and signature-based techniques for optimal coverage.
Even if the model is well-crafted, ML presents some additional challenges when it comes to interpreting the output, including:
Beyond the pros and cons of ML, there’s one more catch: Not all “ML” is really ML. Statistics gives you some conclusions about your data. ML makes predictions about data you didn’t have based on data you did have. Marketers have enthusiastically latched onto “machine learning” and “artificial intelligence” to signal a modern, innovative, advanced technology product of some kind. However, there’s often very little regard for whether the tech even uses ML, never mind if ML was the right approach.
So, Can ML Detect Evil or Not?
ML can detect evil when “evil” is well-defined and narrowly scoped. It can also detect deviations from expected behavior in highly predictable systems. The more stable the environment, the more likely ML is to correctly identify anomalies. But not every anomaly is malicious, and the operator isn’t always equipped with enough context to respond. ML’s superpower is not in replacing but in extending the capabilities of existing methods, systems, and teams for optimal coverage and efficiency.
Copyright © 2022 Informa PLC Informa UK Limited is a company registered in England and Wales with company number 1072954 whose registered office is 5 Howick Place, London, SW1P 1WG.