## Overview of anomaly detection, review of multivariate Gaussian distribution, and implementation of basic anomaly detection algorithm in Python with two examples

Our innate ability to recognize patters allows us to use this skill in filling-in gaps or predicting what is going to happen next. Occasionally, however, something happens that does not fit our expectation and does not fall into our perception of a pattern. We call such occurrences anomalies. If we are trying to predict something, we may want to exclude anomalies from our training data. Or perhaps we want to identify anomalies to help make our life better. In either case, anomaly detection techniques can prove to be useful and applicable in most industries and subject areas.

This article will guide you through the basics of anomaly detection and implementation of statistical anomaly detection model.

In general terms, anomaly detection refers to the process of identifying phenomena that is out of ordinary. The goal of anomaly detection is to identify events, occurrences, data points, or outcomes that are not in line with our expectations and do not fit some underlying pattern. Hence, the key to implementing anomaly detection is to understand the underlying pattern of expected events. If we know the pattern of the expected, we can use it to map the never-before-seen data points; if our mapping is not successful and our new data point falls outside of our expected pattern, it’s probable that we have found our anomaly.

There are three types of anomalies that typically occur. First type includes individual instances which are considered anomalous with respect to the entire dataset (e.g., an individual car driving at very low speed on a highway is anomalous compared to all highway traffic). Second type includes instances which are anomalies within a specific context (e.g., credit card transactions which appear OK when compared to all credit card transactions but are anomalous for the specific individual’s spending pattern). Third type of anomalies is collective — a set of instances may be considered anomalous even though each instance on its own follows a certain expectation (e.g., a single fraudulent credit card transaction on Amazon may not seem out of ordinary but a set of transactions that take place back to back in a short amount of time is suspicious) [1].

Anomaly detection techniques fall into three categories:

**Supervised detection**requires positive and anomalous labels in the dataset. Supervised learning algorithms like neural networks or boosted forests can be applied to categorize data points into expected/anomaly classes. Unfortunately, anomaly datasets tend to be very imbalanced and generally do not have enough training samples to allow up or downsampling techniques to aid the supervised learning.**Semi-supervised detection**deals with data that is partially labeled. Semi-supervised techniques assume that the input data only contains positive instances and that the input data follows an expected pattern. These techniques attempt to learn the distribution of positive cases in order to be able to generate positive instances. During testing, the algorithm will evaluate the likelihood that the anomalous instance could have been generated by the model and uses this probability to predict anomalous cases. [2]**Unsupervised detection**uses completely unlabeled data in order to create a boundary of expectation and anything that falls outside of this boundary is considered to be anomalous.

Anomaly detection techniques can be applied to any data and data format impacts which algorithm will be most useful. Types of data include **series** (time series, linked list, language, sound), **tabular** (e.g., engine sensor data), **image** (e.g., X-ray images), and **graph** (e.g., workflow or process).

Given the variety of problems and techniques, anomaly detection is actually a vast area of data science with many applications. Some of these applications include: fraud detection, cybersecurity applications, analysis of sales or transactional data, identification of rare diseases, monitoring of manufacturing processes, exoplanet search, machine learning preprocessing, and many more. Therefore, access to powerful and performant algorithms has the potential to make significant impact in many fields.

Let’s take a look how at the most basic algorithm that can be used to detect anomalies.

One of the basic anomaly detection techniques employs the power of Gaussian (i.e. Normal) distribution in order to identify outliers.

Discovered by Carl Friedrich Gauss, Gaussian distribution models many natural phenomena and is, therefore, a popular choice for modeling features in a dataset. This distribution’s probability density function is a bell curve centered at the arithmetic mean and the width of the curve is defined by the variance of the dataset. With the majority of the cases being at or near the center, the probability density function features two elongated tails on each end. The more rare the instance — the further it is from the center — the more likely it is to be an outlier or an anomaly. Eureka!— we can use this concept to model anomalies in our dataset.

The probability density function, defined as f(x), measures the probability of some outcome x in our dataset. Formally,

Let’s assume that our dataset had only one feature and that feature followed a normal distribution, then we can model our anomaly detection algorithm using f(x) from above. We can then set some threshold epsilon which will determine if a case is anomalous or not. Epsilon should be set heuristically and its value will depend on the use case and the preferred sensitivity for anomalies.

In a normal distribution, 2.5% of instances occur two standard deviations below the mean value. So if we set our threshold to 0.054, then about 2.5% of events in our dataset will be classified as anomalies (CDF of 2 standard deviations below the mean is 2.5 and PDF at -2 is 0.054). Lower thresholds will yield fewer classified anomalies and higher thresholds will be less sensitive.

In real world, there is likely to be a tradeoff as some of positive cases may fall below the threshold and some of the anomalies may hide above the threshold. It will be necessary to understand the use case and test different epsilon values before settling on the one that is best suited.

An example with a single feature is trivial — what do we do if we have more than one feature? If our features are completely independent, we can actually take the product of the feature probability density function in order to classify anomalies.

For a two uncorrelated feature case, this becomes

Essentially, the product of probabilities of features can ensure that if at least one feature has an outlier, we can detect an anomaly (given that our epsilon is high enough); if our instance exhibits an outlier value in several features, our probability will be even smaller (since our total probability value is a product of fractions) and a value is even more likely to be an anomaly.

However, *we cannot assume that our features are independent*. And this is where a multivariate probability density function comes it. In the multivariate case, we build a covariance matrix (denoted by a Σ) in order to capture how the features are related to each other. Then, we can use the covariance matrix to avoid “double-counting” of feature relations (this is a very rudimentary way of phrasing what is actually happening). The formula for multivariate distribution probability density function is shown below and these slides from Duke do a good job and deriving the formula.

Here, x is an input vector, μ is a vector of feature means and Σ is a covariance matrix between the features.

To make our life easier, we can use scipy library to implement this function: scipy.stats.multivariate_normal takes as input a vector of feature means and standard deviations and has a .pdf method for returning probability density given a set of points.

Let’s try this implementation on an actual example.

First, let’s observe a two-feature example which will allow us to visualize anomalies in Eucledian space. For this example, I generated two features with 100 samples drawn from the Normal distribution (these are the positive samples). I calculated feature means and standard deviations and fit a multivariate normal model from the scipy.stats library with the distribution information. **Of note: **I fit my model with positive samples only. In real-world data, we want to clean our dataset to ensure that the features follow normal distribution and do not contain outliers or odd values — this will improve models ability to locate anomalies (especially since it will help ensure the feature Normal distribution requirement). Finally, I added 5 anomalous samples to my dataset and use the .pdf method to report the probabilities.