Anomaly Detection on Streams
Anomaly detection generally refers to the process of automatically detecting events or behaviors which deviate from those considered normal. It is an unsupervised process, and can thus detect anomalies which have not been previously encountered. It is based on estimating a model of typical behavior from past observations and consequently comparing current observations against this model. It can be performed either on a single stream or among multiple streams. Anomaly detection encompasses outlier detection as well as change detection and therefore is closely related to forecasting and clustering methods.
Anomaly detection in streams has close connections to traditional outlier detection, as well as to change detection. The former is a common and widely studied topic in statistics . The latter emerged in the context of statistical monitoring and control for continuous processes and the widely used CUSUM algorithm was proposed as early as 1954 . With the emergence of data stream management systems, anomaly detection in this setting has received significant attention, with applications in network management and intrusion detection, environmental monitoring, and surveillance, to mention a few.
Anomaly detection is closely related to outlier detection and change detection. After a review of the main ideas, the streaming case is presented.
The existing approaches to outlier detection can be broadly classified into the following categories. Typically, outlier detection relies on a model for the data. Model parameters are estimated based on appropriately chosen historical data. As new observations arrive, they are either compared directly against the model and are declared outliers if the fit is poor. Alternatively, a second set of model parameters may be estimated from recent observations. If there is a statistically significant difference among the two sets of parameters, the new observations are declared as outliers.
Clustering-Based and Forecasting-Based Approaches
Many clustering and forecasting algorithms detect outliers as by-products. However, not all clustering or forecasting procedures can be easily turned into outlier detection procedures.
Methods in this category are typically found in statistics textbooks. They deploy some standard distribution model (e.g., Gaussian) and flag as outliers those objects which deviate from the model. These work well in many occasions, but may be unsuitable for high-dimensional data sets, or when reasonable assumptions about the distribution of data points cannot be made.
Distance-Based and Density-Based Approaches
A point in a data set is a distance-based outlier if at least a fraction β of all other points are further than r from it. This outlier definition is based on a single, global criterion determined by the parameters r and β. This can lead to problems when the data set has both dense and sparse regions. Density-based approaches aim to remedy this problem, by relying on the local density of each point’s neighborhood.
Sequential hypothesis testing and sequential change detection arose out of problems in statistical process control. Assume a collected sequence of observations, modeled as random variables X 1, X 2,…, X t ,…. Additionally, assume that X t are drawn from a distribution with parameter θ and that a test of whether the true parameter is θ 0 or θ 1 is desired.
The Sequential Likelihood Ratio Test (SLRT) relies on the logarithm of likelihood ratios z t := log(p(x t ; θ 0)∕p(x t ; θ 1)) and tests the cumulative sum z 1 +…+ z t to decide upon the true parameter.
This can be extended to other settings, such as detecting changes in other distribution parameters. For example, in its simplest form, CUSUM tests for a shift in the mean by essentially applying SLRT, assuming points independently drawn from a Gaussian distribution with known variance. Many other versions have appeared since the CUSUM test was first proposed , relaxing or modifying some of these assumptions.
In general, change detection is closely related to outlier detection; in fact, change detection may also be viewed as outlier detection along the time axis.
Limited resources. In a streaming setting, a large number of observations arrives over time and the total volume of data grows indefinitely. However processing and storage capacity are limited, in comparison to the amount of data. Therefore, data summarization or sketching techniques need to be applied, in order to extract a few, relevant features from the raw data.
Concept drift. In an indefinitely growing collection of observations, changes in the underlying features (e.g., distribution parameters) may not necessarily correspond to anomalies, but rather be part of normal changes in the behavior of the system. Thus, mechanisms to handle such non-stationarity or concept drift and adapt to changing behavior are necessary .
Next, several of the approaches that have been studied in the literature are reviewed.
In the past several year, a number of techniques for sketch or synopsis construction have appeared, with applications to many stream processing problems. Some examples include CM sketches, AMS sketches, FM sketches, and Bloom filters . Other summarization techniques specifically for data clustering on streams have appeared, such as those in [1, 4], which can be easily extended for outlier detection on streams.
In many applications, the appearance of sudden bursts in the data often signifies an anomaly. For example, in a network monitoring application, a burst in the traffic volume to a particular destination may signify a denial of service (DoS) attack. Thus, burst detection on streams has received significant attention. Examples of such work include  and .
Often a collection of multiple streams is available and measurements from different streams may be highly correlated with each other. If the strength of correlations changes over time  or the number of correlated components varies , this often signifies changes in the underlying data-generating process that may be due to anomalies.
More generally, detecting significant changes has been studied in the context of stream processing .
With the widespread adoption of the internet, various forms of malware (e.g., viruses, worms, trojans, and botnets) have become a serious and costly issue. Most intrusion detection systems (IDS) rely on known signatures to identify malicious payloads or behaviors. However, there are several efforts underway for automatic detection of suspicious activity on the fly, as well as for automating the signature extraction process.
Maintenance costs for large computer clusters or networks is traditionally labor-intensive and contributes a large fraction of total cost of ownership. Hence, autonomic computing initiatives aim at automating this process. An important first step is the automatic, unsupervised detection of abnormal events (e.g., node or link failures) based on continuously collected system metrics. Streaming anomaly detection methods are used to address this problem.
Applications in quality control and industrial process control have traditionally provided much of the impetus for the development of change detection methods. Machinery used in a production chain (e.g., food preparation or chip fabrication) typically monitor a large number of process parameters at each step. Early detection of sudden changes in those parameters is important to identify potential flaws in the process which can severely affect end product quality.
Small and cheap sensors which can continuously monitor patient physiological data (e.g., temperature, blood pressure, heart rate, ECG measurements, glucose levels, etc.) are becoming widely available. Anomaly detection methods can prove essential in enabling early diagnosis of potential life-threatening conditions, as well as preventive healthcare.
Early detection of faults by continuously monitoring civil infrastructure components (e.g., bridges, buildings, and roadways) can reduce maintenance costs and increase safety. Similarly, surveillance systems on urban environments rely on anomaly detection methods to spot suspicious activities and increase security.
Certain anomalies can be detected only by taking into account information collected from a large number of different sources. Even if data ownership issues are resolved, collecting all this information at a central site is often infeasible due to its large volume. A number of efforts have tackled this problem in the past few years, but much remains to be done, especially as the scale of information collected increases. Also related to this trend is anomaly detection on more complex data, such as time-evolving graphs.
- 1.Aggarwal CC, Han J, Wang J, and Yu PS. A Framework for clustering evolving data streams. In: Proceeding of the 29th International Conference on Very Large Data Bases, 2003, p. 81–92.Google Scholar
- 5.Hulten G, Spencer L, and Domingos P. Mining time-changing data streams. In: Proceeding of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001, p. 97–106.Google Scholar
- 7.Kleinberg J. Bursty and hierarchical structure in streams. In: Proceeding of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002, p. 91–101.Google Scholar
- 10.Papadimitriou S, Sun J, and Faloutsos C. Streaming pattern discovery in multiple time-series. In: Proceeding of the 31st International Conference on Very Large Data Bases, 2005, p. 697–708.Google Scholar
- 12.Wang H, Fan W, Yu PS, and Han J. Mining concept-drifting data streams using ensemble classifiers. In: Proceeding of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, p. 226–35.Google Scholar
- 13.Zhu Y and Shasha D. StatStream: statistical monitoring of thousands of data streams in real time. In: Proceeding of the 28th International Conference on Very Large Data Bases, 2002, p. 358–69.Google Scholar
- 14.Zhu Y and Shasha D. Efficient elastic burst detection in data streams. In: Proceeding of 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2003, p. 336–45.Google Scholar