Large-Scale Data Analytics Using Ensemble Clustering

Hahmann, Martin; Habich, Dirk; Lehner, Wolfgang

doi:10.1007/978-1-4614-1415-5_11

Martin Hahmann³,
Dirk Habich³ &
Wolfgang Lehner³

1563 Accesses

Abstract

Data clustering is a highly used analysis technique in many application domains. From the end user’s perspective, the wide variety of available algorithms and their technical parameterization bring major difficulties in the determination of a user-satisfying clustering result. To overcome this issue in the context of large-scale analysis, we developed a novel feedback-driven clustering process. Aside from presenting the theoretical concepts, we also describe our developed infrastructure to efficiently handle the still increasing data volumes, within our process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
Google Scholar
Balcan, M.F., Blum, A.: Clustering with interactive feedback. In: Proc. of ALT, pp. 316–328 (2008)
Google Scholar
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum (1981)
MATH Google Scholar
Chiu, K., Govindaraju, M., Bramley, R.: Investigating the limits of soap performance for scientific computing. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 246–254 (2002)
Google Scholar
Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 224–227 (1979)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
Google Scholar
Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybernetics and Systems (1974)
Google Scholar
van Engelen, R.: Pushing the soap envelope with web services for scientific computing. In: Proceedings of the International Conference on Web Services, pp. 346–352 (2003)
Google Scholar
Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Prentice Hall PTR (2005)
Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of KDD (1996)
Google Scholar
Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21 (1965)
Google Scholar
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proc. of ICDE (2005)
Google Scholar
Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. TKDD 1(1) (2007)
Google Scholar
Habich, D., Lehner, W., Richly, S., Assmann, U.: Using cloud technologies to optimize data-intensive service applications. In: Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, pp. 19–26 (2010)
Google Scholar
Habich, D., Preißler, S., Lehner, W., Richly, S., Aßmann, U., Grasselt, M., Maier, A.: Data-grey-box web services in data-centric environments. In: Proceedings of the 2007 IEEE International Conference on Web Services, pp. 976–983 (2007)
Google Scholar
Habich, D., Richly, S., Grasselt, M., Preißler, S., Lehner, W., Maier, A.: Bpel^DT - data-aware extension of bpel to support data-intensive service applications. In: Proceedings of the 2nd ECOWS07 Workshop on Emerging Web Services Technology, pp. 111–128 (2007)
Google Scholar
Habich, D., Wächter, T., Lehner, W., Pilarsky, C.: Two-phase clustering strategy for gene expression data sets. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 145–150 (2006)
Google Scholar
Hahmann, M., Habich, D., Lehner, W.: Evolving ensemble-clustering to a feedback-driven process. In: Proceedings of the IEEE ICDM Workshop on Visual Analytics and Knowledge Discovery (VAKD) (2010)
Google Scholar
Hahmann, M., Habich, D., Lehner, W.: Visual decision support for ensemble-clustering. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM) (2010). (to appear)
Google Scholar
Hahmann, M., Volk, P., Rosenthal, F., Habich, D., Lehner, W.: How to control clustering results? flexible clustering aggregation. In: Advances in Intelligent Data Analysis VIII, pp. 59–70 (2009)
Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3) (1999)
Google Scholar
Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1) (1998)
Google Scholar
Kaufman, L., Rousseeuw, P.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience (1990)
Google Scholar
Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129–137 (1982)
Article MATH MathSciNet Google Scholar
Ng, A.: Optimising web services performance with table driven xml. In: Proceedings of the 17th Australian Software Engineering Conference (2006)
Google Scholar
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1099–1110 (2008)
Google Scholar
Rand, W.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)
Article Google Scholar
Richly, S., Habich, D., Thiele, M., Goetz, S., Hartung, S.: Supporting gene expression analysis processes by a service-oriented platform. In: Proceedings of the 2007 IEEE International Conference on Services Computing, pp. 739–746 (2007)
Google Scholar
Services, A.W.: Amazon SimpleDB. http://aws.amazon.com/simpledb/ (2009)
Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: VL ’96: Proceedings of the 1996 IEEE Symposium on Visual Languages, p. 336. IEEE Computer Society, Washington, DC, USA (1996)
Google Scholar
Simitsis, A.: Modeling and managing etl processes. In: roceedings of the VLDB 2003 PhD Workshop. Co-located with the 29th International Conference on Very Large Data Bases (2003)
Google Scholar
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801–804 (1956)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles a knowledge reuse framework for combining partitionings. In: Proc. of AAAI (2002)
Google Scholar
Weerawarana, S., Curbera, F., Leymann, F., Storey, T., Ferguson., D.F.: Web Services Platform Architecture : SOAP, WSDL, WS-Policy, WS-Addressing, WS-BPEL, WS-Reliable Messaging, and More. Prentice Hall PTR (2005)
Google Scholar
Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An adaptive meta-clustering approach: Combining the information from different clustering results. In: Proc. of CSB (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Database Technology Group, Dresden University of Technology, Dresden, Germany
Martin Hahmann, Dirk Habich & Wolfgang Lehner

Authors

Martin Hahmann
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Habich
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Hahmann .

Editor information

Editors and Affiliations

Dept. of Computer Science & Engineering, Florida Atlantic University, Boca Raton, 33431, Florida, USA
Borko Furht
LexisNexis, Boca Raton, 33487, Florida, USA
Armando Escalante

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Hahmann, M., Habich, D., Lehner, W. (2011). Large-Scale Data Analytics Using Ensemble Clustering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_11

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1415-5_11
Published: 11 November 2011
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1414-8
Online ISBN: 978-1-4614-1415-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics