Abstract
Data clustering is a highly used analysis technique in many application domains. From the end user’s perspective, the wide variety of available algorithms and their technical parameterization bring major difficulties in the determination of a user-satisfying clustering result. To overcome this issue in the context of large-scale analysis, we developed a novel feedback-driven clustering process. Aside from presenting the theoretical concepts, we also describe our developed infrastructure to efficiently handle the still increasing data volumes, within our process.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
Balcan, M.F., Blum, A.: Clustering with interactive feedback. In: Proc. of ALT, pp. 316–328 (2008)
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum (1981)
Chiu, K., Govindaraju, M., Bramley, R.: Investigating the limits of soap performance for scientific computing. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 246–254 (2002)
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 224–227 (1979)
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybernetics and Systems (1974)
van Engelen, R.: Pushing the soap envelope with web services for scientific computing. In: Proceedings of the International Conference on Web Services, pp. 346–352 (2003)
Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Prentice Hall PTR (2005)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of KDD (1996)
Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21 (1965)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proc. of ICDE (2005)
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. TKDD 1(1) (2007)
Habich, D., Lehner, W., Richly, S., Assmann, U.: Using cloud technologies to optimize data-intensive service applications. In: Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, pp. 19–26 (2010)
Habich, D., Preißler, S., Lehner, W., Richly, S., Aßmann, U., Grasselt, M., Maier, A.: Data-grey-box web services in data-centric environments. In: Proceedings of the 2007 IEEE International Conference on Web Services, pp. 976–983 (2007)
Habich, D., Richly, S., Grasselt, M., Preißler, S., Lehner, W., Maier, A.: BpelDT - data-aware extension of bpel to support data-intensive service applications. In: Proceedings of the 2nd ECOWS07 Workshop on Emerging Web Services Technology, pp. 111–128 (2007)
Habich, D., Wächter, T., Lehner, W., Pilarsky, C.: Two-phase clustering strategy for gene expression data sets. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 145–150 (2006)
Hahmann, M., Habich, D., Lehner, W.: Evolving ensemble-clustering to a feedback-driven process. In: Proceedings of the IEEE ICDM Workshop on Visual Analytics and Knowledge Discovery (VAKD) (2010)
Hahmann, M., Habich, D., Lehner, W.: Visual decision support for ensemble-clustering. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM) (2010). (to appear)
Hahmann, M., Volk, P., Rosenthal, F., Habich, D., Lehner, W.: How to control clustering results? flexible clustering aggregation. In: Advances in Intelligent Data Analysis VIII, pp. 59–70 (2009)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3) (1999)
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1) (1998)
Kaufman, L., Rousseeuw, P.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience (1990)
Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129–137 (1982)
Ng, A.: Optimising web services performance with table driven xml. In: Proceedings of the 17th Australian Software Engineering Conference (2006)
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1099–1110 (2008)
Rand, W.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
Richly, S., Habich, D., Thiele, M., Goetz, S., Hartung, S.: Supporting gene expression analysis processes by a service-oriented platform. In: Proceedings of the 2007 IEEE International Conference on Services Computing, pp. 739–746 (2007)
Services, A.W.: Amazon SimpleDB. http://aws.amazon.com/simpledb/ (2009)
Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: VL ’96: Proceedings of the 1996 IEEE Symposium on Visual Languages, p. 336. IEEE Computer Society, Washington, DC, USA (1996)
Simitsis, A.: Modeling and managing etl processes. In: roceedings of the VLDB 2003 PhD Workshop. Co-located with the 29th International Conference on Very Large Data Bases (2003)
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801–804 (1956)
Strehl, A., Ghosh, J.: Cluster ensembles a knowledge reuse framework for combining partitionings. In: Proc. of AAAI (2002)
Weerawarana, S., Curbera, F., Leymann, F., Storey, T., Ferguson., D.F.: Web Services Platform Architecture : SOAP, WSDL, WS-Policy, WS-Addressing, WS-BPEL, WS-Reliable Messaging, and More. Prentice Hall PTR (2005)
Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An adaptive meta-clustering approach: Combining the information from different clustering results. In: Proc. of CSB (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Hahmann, M., Habich, D., Lehner, W. (2011). Large-Scale Data Analytics Using Ensemble Clustering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_11
Download citation
DOI: https://doi.org/10.1007/978-1-4614-1415-5_11
Published:
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)