Abstract
This paper proposed StrAP (Streaming AP), extending Affinity Propagation (AP) to data steaming. AP, a new clustering algorithm, extracts the data items, or exemplars, that best represent the dataset using a message passing method. Several steps are made to build StrAP. The first one (Weighted AP) extends AP to weighted items with no loss of generality. The second one (Hierarchical WAP) is concerned with reducing the quadratic AP complexity, by applying AP on data subsets and further applying Weighted AP on the exemplars extracted from all subsets. Finally StrAP extends Hierarchical WAP to deal with changes in the data distribution. Experiments on artificial datasets, on the Intrusion Detection benchmark (KDD99) and on a real-world problem, clustering the stream of jobs submitted to the EGEE grid system, provide a comparative validation of the approach.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Fan, W., Wang, H., Yu, P.: Active mining of data streams. In: SIAM Conference on Data Mining (SDM) (2004)
Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Int. Conf. on Very Large Data Bases(VLDB), pp. 81–92 (2003)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: IEEE Symposium on Foundations of Computer Science, pp. 359–366 (2000)
Cao, F., Ester, M., Qian, W., Zhou, A.: Density-based clustering over an evolving data stream with noise. In: SIAM Conference on Data Mining (SDM) (2006)
Muthukrishnan, S.: Data streams: Algorithms and applications. Found. Trends Theor. Comput. Sci. 1, 117–236 (2005)
Papadimitriou, S., Brockwell, A., Faloutsos, C.: Adaptive, hands-off stream mining. In: Int. Conf. on Very Large Data Bases(VLDB), pp. 560–571 (2003)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: ACM Symposium Principles of Database Systems(PODS), pp. 286–296 (2004)
Babcock, B., Olston, C.: Distributed topk monitoring. In: ACM International Conference on Management of Data (SIGMOD), pp. 28–39 (2003)
Frey, B., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007)
Frey, B., Dueck, D.: Supporting online material of clustering by passing messages between data points. Science 315 (2007), http://www.sciencemag.org/cgi/content/full/1136800/DC1
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: Theory and practice. IEEE Transactions on Knowledge and Data Engineering (TKDE) 15, 515–528 (2003)
Page, E.: Continuous inspection schemes 41, 100–115 (1954)
Hinkley, D.: Inference about the change-point from cumulative sum tests. Biometrika 58, 509–523 (1971)
Leone, M., Sumedha, W.M.: Clustering by soft-constraint affinity propagation: Applications to gene-expression data. Bioinformatics 23, 2708 (2007)
Ester, M.: A density-based algorithm for discovering clusters in large spatial databases with noisethe uniqueness of a good optimum for k-means. In: International Conference on Knowledge Discovery and Data Mining(KDD) (1996)
Keogh, E., Xi, X., Wei, L., Ratanamahatana, C.A.: The UCR time series classification/clustering homepage (2006), http://www.cs.ucr.edu/~eamonn/time_series_data/
KDD99: Kdd cup 1999 data (computer network intrusion detection) (1999), http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999)
Dang, X.H., Ng, W.K., Ong, K.L.: An error bound guarantee algorithm for online mining frequent sets over data streams. Journal of Knowledge and Information Systems (2007)
Gama, J., Rocha, R., Medas, P.: Accurate decision trees for mining highspeed data streams. In: ACM International Conference on Management of Data (SIGMOD), pp. 523–528 (2003)
Cormode, G., Korn, F., Muthukrishnan, S., Srivastava, D.: Finding hierarchical heavy hitters in streaming data. ACM Transactions on Knowledge Discovery from Data (TKDD) 1(4) (2008)
Agarwal, D.K.: An empirical bayes approach to detect anomalies in dynamic multidimensional arrays. In: International Conference on Data Mining (ICDM) (2005)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zhang, X., Furtlehner, C., Sebag, M. (2008). Data Streaming with Affinity Propagation. In: Daelemans, W., Goethals, B., Morik, K. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2008. Lecture Notes in Computer Science(), vol 5212. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87481-2_41
Download citation
DOI: https://doi.org/10.1007/978-3-540-87481-2_41
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87480-5
Online ISBN: 978-3-540-87481-2
eBook Packages: Computer ScienceComputer Science (R0)