Skip to main content

Large-Scale Data Analytics Using Ensemble Clustering

  • Chapter
  • First Online:
Book cover Handbook of Data Intensive Computing

Abstract

Data clustering is a highly used analysis technique in many application domains. From the end user’s perspective, the wide variety of available algorithms and their technical parameterization bring major difficulties in the determination of a user-satisfying clustering result. To overcome this issue in the context of large-scale analysis, we developed a novel feedback-driven clustering process. Aside from presenting the theoretical concepts, we also describe our developed infrastructure to efficiently handle the still increasing data volumes, within our process.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)

    Google Scholar 

  2. Balcan, M.F., Blum, A.: Clustering with interactive feedback. In: Proc. of ALT, pp. 316–328 (2008)

    Google Scholar 

  3. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York: Plenum (1981)

    MATH  Google Scholar 

  4. Chiu, K., Govindaraju, M., Bramley, R.: Investigating the limits of soap performance for scientific computing. In: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, pp. 246–254 (2002)

    Google Scholar 

  5. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 224–227 (1979)

    Google Scholar 

  6. Dean, J., Ghemawat, S.: Mapreduce: Simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  7. Dunn, J.C.: A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters. Cybernetics and Systems (1974)

    Google Scholar 

  8. van Engelen, R.: Pushing the soap envelope with web services for scientific computing. In: Proceedings of the International Conference on Web Services, pp. 346–352 (2003)

    Google Scholar 

  9. Erl, T.: Service-Oriented Architecture (SOA): Concepts, Technology, and Design. Prentice Hall PTR (2005)

    Google Scholar 

  10. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of KDD (1996)

    Google Scholar 

  11. Forgy, E.W.: Cluster analysis of multivariate data: Efficiency versus interpretability of classification. Biometrics 21 (1965)

    Google Scholar 

  12. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. In: Proc. of ICDE (2005)

    Google Scholar 

  13. Gionis, A., Mannila, H., Tsaparas, P.: Clustering aggregation. TKDD 1(1) (2007)

    Google Scholar 

  14. Habich, D., Lehner, W., Richly, S., Assmann, U.: Using cloud technologies to optimize data-intensive service applications. In: Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, pp. 19–26 (2010)

    Google Scholar 

  15. Habich, D., Preißler, S., Lehner, W., Richly, S., Aßmann, U., Grasselt, M., Maier, A.: Data-grey-box web services in data-centric environments. In: Proceedings of the 2007 IEEE International Conference on Web Services, pp. 976–983 (2007)

    Google Scholar 

  16. Habich, D., Richly, S., Grasselt, M., Preißler, S., Lehner, W., Maier, A.: BpelDT - data-aware extension of bpel to support data-intensive service applications. In: Proceedings of the 2nd ECOWS07 Workshop on Emerging Web Services Technology, pp. 111–128 (2007)

    Google Scholar 

  17. Habich, D., Wächter, T., Lehner, W., Pilarsky, C.: Two-phase clustering strategy for gene expression data sets. In: Proceedings of the 2006 ACM Symposium on Applied Computing, pp. 145–150 (2006)

    Google Scholar 

  18. Hahmann, M., Habich, D., Lehner, W.: Evolving ensemble-clustering to a feedback-driven process. In: Proceedings of the IEEE ICDM Workshop on Visual Analytics and Knowledge Discovery (VAKD) (2010)

    Google Scholar 

  19. Hahmann, M., Habich, D., Lehner, W.: Visual decision support for ensemble-clustering. In: Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM) (2010). (to appear)

    Google Scholar 

  20. Hahmann, M., Volk, P., Rosenthal, F., Habich, D., Lehner, W.: How to control clustering results? flexible clustering aggregation. In: Advances in Intelligent Data Analysis VIII, pp. 59–70 (2009)

    Google Scholar 

  21. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31(3) (1999)

    Google Scholar 

  22. Karypis, G., Kumar, V.: A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing 20(1) (1998)

    Google Scholar 

  23. Kaufman, L., Rousseeuw, P.: Finding Groups in Data An Introduction to Cluster Analysis. Wiley Interscience (1990)

    Google Scholar 

  24. Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28, 129–137 (1982)

    Article  MATH  MathSciNet  Google Scholar 

  25. Ng, A.: Optimising web services performance with table driven xml. In: Proceedings of the 17th Australian Software Engineering Conference (2006)

    Google Scholar 

  26. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 1099–1110 (2008)

    Google Scholar 

  27. Rand, W.: Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association 66(336), 846–850 (1971)

    Article  Google Scholar 

  28. Richly, S., Habich, D., Thiele, M., Goetz, S., Hartung, S.: Supporting gene expression analysis processes by a service-oriented platform. In: Proceedings of the 2007 IEEE International Conference on Services Computing, pp. 739–746 (2007)

    Google Scholar 

  29. Services, A.W.: Amazon SimpleDB. http://aws.amazon.com/simpledb/ (2009)

  30. Shneiderman, B.: The eyes have it: A task by data type taxonomy for information visualizations. In: VL ’96: Proceedings of the 1996 IEEE Symposium on Visual Languages, p. 336. IEEE Computer Society, Washington, DC, USA (1996)

    Google Scholar 

  31. Simitsis, A.: Modeling and managing etl processes. In: roceedings of the VLDB 2003 PhD Workshop. Co-located with the 29th International Conference on Very Large Data Bases (2003)

    Google Scholar 

  32. Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. C1. III vol IV, 801–804 (1956)

    Google Scholar 

  33. Strehl, A., Ghosh, J.: Cluster ensembles a knowledge reuse framework for combining partitionings. In: Proc. of AAAI (2002)

    Google Scholar 

  34. Weerawarana, S., Curbera, F., Leymann, F., Storey, T., Ferguson., D.F.: Web Services Platform Architecture : SOAP, WSDL, WS-Policy, WS-Addressing, WS-BPEL, WS-Reliable Messaging, and More. Prentice Hall PTR (2005)

    Google Scholar 

  35. Zeng, Y., Tang, J., Garcia-Frias, J., Gao, G.R.: An adaptive meta-clustering approach: Combining the information from different clustering results. In: Proc. of CSB (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Martin Hahmann .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Hahmann, M., Habich, D., Lehner, W. (2011). Large-Scale Data Analytics Using Ensemble Clustering. In: Furht, B., Escalante, A. (eds) Handbook of Data Intensive Computing. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1415-5_11

Download citation

  • DOI: https://doi.org/10.1007/978-1-4614-1415-5_11

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4614-1414-8

  • Online ISBN: 978-1-4614-1415-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics