April 24, 2020

Anomaly detection explained for beginners

Mike Khanna

Anomaly detection lets banks flag frauds, factories identify failing equipment, and Sysadmins do intrusion detection. But how can you leverage it without employing an army of data scientists?

Anomaly detection is an extremely powerful technique for identifying critical outliers in a set of data. It is particularly useful where you are trying to spot a rare but mission-critical event. For instance, if you want to spot unusual transactions on someone’s bank account. Or when you need to identify problems with a piece of important machinery before it fails. In this blog, we look at the history and basics of anomaly detection and show how anyone can leverage it.

The origins of anomaly detection

Data science powers some of the most impressive applications of technology in the modern world. It underpins all those things branded ‘big data’, ‘machine learning’, or ‘artificial intelligence’. Over the past decades, it has given us some incredibly useful techniques for analyzing and interpreting data. In turn, these techniques allow us to construct powerful machine learning models. You can then use such models to power AI systems to solve important real-world problems.

Some key data science techniques

Data science is primarily about making usable observations about large sets of data. By ‘usable observations’ I mean identifying features, patterns, and anomalies in the data. In turn, you can use these observations to understand and leverage the data.

Feature engineering

In data science, a feature is a specific aspect of a data set that you can quantify in some form. For instance, in a set of accounts, it could be the amount spent. Often, these features may not be so clear-cut. Or you may find that they are incomplete. Feature engineering allows you to clean up the dataset, discarding some features, combining others.

Finding patterns

Datasets often exhibit clear patterns. Sometimes, a human can easily spot these—after all, our brain is remarkably good at pattern recognition. So good in fact that we are prone to spotting patterns when none exists. But how do you spot patterns in enormous datasets? Data science offers us several techniques depending on how much we know about the data. For instance, if we know very little, we can turn to K-means clustering to find features in the data. Or if we know more about the data, we can use one of the myriad forms of supervised learning to find the patterns.

Anomaly detection

An anomaly is a data point that fulfills two key criteria. Firstly, its values differ markedly from normal data values. And secondly, it only occurs very rarely in the dataset. You generally classify anomalies as either univariate or multivariate. Univariate anomalies relate to a single data feature. Multivariate anomalies exist across multiple features. While you can do sometimes do it, detecting anomalies by hand is extremely hard. Especially, if you are working with time-series data.

The basics of anomaly detection

Anomaly detection is a whole field of data science by itself. So, all I can show you here is the very basics. First, I need to explain the different types of anomaly you can find.

Point anomalies happen when a single data point is anomalous. This is the classic outlier on a graph. This is a form of univariate anomaly.

Contextual anomalies require knowledge of the surrounding context. That is, they may only be anomalous under some circumstances. These can be either univariate or multivariate.

Collective anomalies are more subtle. Each data point may not be anomalous, but taken together, you know something is odd.

To identify anomalies, you can use several techniques. One of the common ones is called isolation forest (a form of unsupervised learning). Unsurprisingly (given the name), this is a tree-based method for anomaly detection. You start by choosing a random partition in the data. Next, you recursively subdivide the partition, taking a value between the minimum and maximum. You repeat this process until you have a partition with just one value (the anomaly) or all the data points have the same value. This approach works equally well for 1- and 2-dimensional data.

Clustering-based methods

Many approaches for anomaly detection rely on trying to identify all clusters within the data. If you do it right, any data points that lie outside of clusters are anomalies. There are many approaches to doing this, and it is an active research field.

Anomaly detection allows you to spot anomalies that lie outside the clusters. Sonasoft NuGene makes this easy.

Density-based methods

Here, you are trying to identify how dense the data is within a given neighborhood. If you assume anomalies lie outside of dense areas, you can use this approach to spot them. This requires you to score the potential outlier based on some measure, such as Euclidean distance. You can use several well-known techniques, including k-nearest neighbor or local-outlier factor.

Applications of anomaly detection

You can use anomaly detection to solve a whole range of business use cases. Let’s look at three examples.

Fraud detection

Credit and bank card fraud cost the economy billions each year. Spotting fraud is therefore big business. You have several ways to do this. For instance, you can find any point anomalies in the amount spent in a single transaction. A sudden high expenditure may indicate the card is being used fraudulently. Or you might use context-based anomaly detection. You might spot that a card is suddenly used to make a large number of transactions in a foreign country. Either the card-owner is traveling, or the card has been stolen.

Identifying machinery that’s going to fail

In heavy industry and manufacturing, you need to identify potential machine failures before they happen. For instance, in many mining operations, you are reliant on pumps working 24/7. You can identify collective anomalies to spot potential failures. For instance, if the oil pressure, temperature, and engine vibration increase it might mean the oil pump is failing.

Intrusion detection

Another use case for collective anomalies is intrusion detection. Often, once a hacker is inside your network they will try to copy as much data as possible. You might look for unusual patterns of data copies to identify this. Alternatively, you could use contextual anomalies to detect it. Typically, this might involve spotting that a given user is suddenly accessing data or systems they never did before.

How Sonasoft Sabire helps

If you want to create usable anomaly detection systems, you will find it time-consuming, hard, and expensive. Fortunately, Sonasoft Saibre is designed to solve exactly this sort of problem for you. Saibre is our industry-leading AI bot factory. At its heart lies a unified AI platform that can autonomously create machine learning models for you. If you ask a data scientist for an ML model, they can only create one at a time. But Saibre will try out dozens of different models to find the best one for the job. It then integrates this model into a proper autonomous bot that you can install within your system. You can create almost any type of bot with Saibre . Speak to us if you would like a demonstration of how Saibre can transform your mission-critical processes.

White Paper

SAIBRE AI Ecosystem

End-to-end AI applications that solve any business problem