Limits...
A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

Goldstein M, Uchida S - PLoS ONE (2016)

Bottom Line: Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets.Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time.As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

View Article: PubMed Central - PubMed

Affiliation: Center for Co-Evolutional Social System Innovation, Kyushu University, Fukuoka, Japan.

ABSTRACT
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

Show MeSH

Related in: MedlinePlus

A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.
© Copyright Policy
Related In: Results  -  Collection

License
getmorefigures.php?uid=PMC4836738&req=5

pone.0152173.g007: A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.

Mentions: All previous anomaly detection algorithms are based on density estimation using nearest-neighbors. In contrast, the cluster-based local outlier factor (CBLOF) [49] uses clustering in order to determine dense areas in the data and performs a density estimation for each cluster afterwards. In theory, every clustering algorithm can be used to cluster the data in a first step. However, in practice k-means is commonly used to take advantage of the low computational complexity, which is linear compared to the quadratic complexity of the nearest-neighbor search. After clustering, CBLOF uses a heuristic to classify the resulting clusters into large and small clusters. Finally, an anomaly score is computed by the distance of each instance to its cluster center multiplied by the instances belonging to its cluster. For small clusters, the distance to the closest large cluster is used. The procedure of using the amount of cluster members as a scaling factor should estimate the local density of the clusters as stated by the authors. We showed in previous work that this assumption is not true [50] and might even result in a incorrect density estimation. Therefore, we additionally evaluate a modified version of CBLOF which simply neglects the weighting, referred to as unweighted-CBLOF (uCBLOF) in the following. The results of uCBLOF using a simple two-dimensional dataset are visualized in Fig 7, where the color corresponds to the clustering result of the preceding k-means clustering algorithm. Similar to the nearest-neighbor based algorithms, the number of initial clusters k is also a critical parameter. Here, we follow the same strategy as for the nearest-neighbor based algorithms and evaluate many different k values. Furthermore, k-means clustering is a non-deterministic algorithm and thus the resulting anomaly scores can be different on multiple runs. To this end we follow a common strategy, which is to apply k-means many times on the data and pick the most stable result. However, clustering-based anomaly detection algorithms are very sensitive to the parameter k, since adding just a single additional centroid might lead to a very different outcome.


A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data.

Goldstein M, Uchida S - PLoS ONE (2016)

A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.
© Copyright Policy
Related In: Results  -  Collection

License
Show All Figures
getmorefigures.php?uid=PMC4836738&req=5

pone.0152173.g007: A visualization of the results for the uCBLOF algorithm.The anomaly score is represented by the bubble size, whereas the color corresponds to the clustering result of the preceded k-means clustering algorithm. Local anomalies are obviously not detected using uCBLOF.
Mentions: All previous anomaly detection algorithms are based on density estimation using nearest-neighbors. In contrast, the cluster-based local outlier factor (CBLOF) [49] uses clustering in order to determine dense areas in the data and performs a density estimation for each cluster afterwards. In theory, every clustering algorithm can be used to cluster the data in a first step. However, in practice k-means is commonly used to take advantage of the low computational complexity, which is linear compared to the quadratic complexity of the nearest-neighbor search. After clustering, CBLOF uses a heuristic to classify the resulting clusters into large and small clusters. Finally, an anomaly score is computed by the distance of each instance to its cluster center multiplied by the instances belonging to its cluster. For small clusters, the distance to the closest large cluster is used. The procedure of using the amount of cluster members as a scaling factor should estimate the local density of the clusters as stated by the authors. We showed in previous work that this assumption is not true [50] and might even result in a incorrect density estimation. Therefore, we additionally evaluate a modified version of CBLOF which simply neglects the weighting, referred to as unweighted-CBLOF (uCBLOF) in the following. The results of uCBLOF using a simple two-dimensional dataset are visualized in Fig 7, where the color corresponds to the clustering result of the preceding k-means clustering algorithm. Similar to the nearest-neighbor based algorithms, the number of initial clusters k is also a critical parameter. Here, we follow the same strategy as for the nearest-neighbor based algorithms and evaluate many different k values. Furthermore, k-means clustering is a non-deterministic algorithm and thus the resulting anomaly scores can be different on multiple runs. To this end we follow a common strategy, which is to apply k-means many times on the data and pick the most stable result. However, clustering-based anomaly detection algorithms are very sensitive to the parameter k, since adding just a single additional centroid might lead to a very different outcome.

Bottom Line: Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets.Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time.As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

View Article: PubMed Central - PubMed

Affiliation: Center for Co-Evolutional Social System Innovation, Kyushu University, Fukuoka, Japan.

ABSTRACT
Anomaly detection is the process of identifying unexpected items or events in datasets, which differ from the norm. In contrast to standard classification tasks, anomaly detection is often applied on unlabeled data, taking only the internal structure of the dataset into account. This challenge is known as unsupervised anomaly detection and is addressed in many practical applications, for example in network intrusion detection, fraud detection as well as in the life science and medical domain. Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as well as common publicly available datasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasets from multiple application domains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknesses of the different approaches for the first time. Besides the anomaly detection performance, computational effort, the impact of parameter settings as well as the global/local anomaly detection behavior is outlined. As a conclusion, we give an advise on algorithm selection for typical real-world tasks.

Show MeSH
Related in: MedlinePlus