Limits...
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

Goldstein M, Uchida S - PLoS ONE (2016)

Bottom Line: Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets.Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time.As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

View Article: PubMed Central - PubMed

Affiliation: Center for Co-Evolutional Social System Innovation, Kyushu University, Fukuoka, Japan.

ABSTRACT
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

Show MeSH

Related in: MedlinePlus

Comparing COF (top) with LOF (bottom) using a simple dataset with a linear correlation of two attributes.It can be seen that the spherical density estimation of LOF fails to recognize the anomaly, whereas COF detects the non-linear anomaly (k = 4).
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4836738&req=5

pone.0152173.g005: Comparing COF (top) with LOF (bottom) using a simple dataset with a linear correlation of two attributes.It can be seen that the spherical density estimation of LOF fails to recognize the anomaly, whereas COF detects the non-linear anomaly (k = 4).

Mentions: The connectivity-based outlier factor [44] is similar to LOF, but the density estimation for the records is performed differently. In LOF, the k-nearest-neighbors are selected based on the Euclidean distance. This indirectly assumes, that the data is distributed in a spherical way around the instance. If this assumption is violated, for example if features have a direct linear correlation, the density estimation is incorrect. COF wants to compensate this shortcoming and estimates the local density of the neighborhood using an shortest-path approach, called the chaining distance. Mathematically, this chaining distance is the minimum of the sum of all distances connecting all k neighbors and the instance. For simple examples, where features are obviously correlated, this density estimation approach performs much more accurate [29]. Fig 5 shows the outcome for LOF and COF in direct comparison for a simple two-dimensional dataset, where the attributes have a linear dependency. It can be seen that the spherical density estimation of LOF cannot detect the outlier, but COF succeeded by connecting the normal records with each other for estimating the local density.


A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

Goldstein M, Uchida S - PLoS ONE (2016)

Comparing COF (top) with LOF (bottom) using a simple dataset with a linear correlation of two attributes.It can be seen that the spherical density estimation of LOF fails to recognize the anomaly, whereas COF detects the non-linear anomaly (k = 4).
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4836738&req=5

pone.0152173.g005: Comparing COF (top) with LOF (bottom) using a simple dataset with a linear correlation of two attributes.It can be seen that the spherical density estimation of LOF fails to recognize the anomaly, whereas COF detects the non-linear anomaly (k = 4).
Mentions: The connectivity-based outlier factor [44] is similar to LOF, but the density estimation for the records is performed differently. In LOF, the k-nearest-neighbors are selected based on the Euclidean distance. This indirectly assumes, that the data is distributed in a spherical way around the instance. If this assumption is violated, for example if features have a direct linear correlation, the density estimation is incorrect. COF wants to compensate this shortcoming and estimates the local density of the neighborhood using an shortest-path approach, called the chaining distance. Mathematically, this chaining distance is the minimum of the sum of all distances connecting all k neighbors and the instance. For simple examples, where features are obviously correlated, this density estimation approach performs much more accurate [29]. Fig 5 shows the outcome for LOF and COF in direct comparison for a simple two-dimensional dataset, where the attributes have a linear dependency. It can be seen that the spherical density estimation of LOF cannot detect the outlier, but COF succeeded by connecting the normal records with each other for estimating the local density.

Bottom Line: Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets.Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time.As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

View Article: PubMed Central - PubMed

Affiliation: Center for Co-Evolutional Social System Innovation, Kyushu University, Fukuoka, Japan.

ABSTRACT
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

Show MeSH
Related in: MedlinePlus