Skip to main content

Recent Advances in High-Dimensional Clustering for Text Data

  • Chapter
  • First Online:
Book cover Claudio Moraga: A Passion for Multi-Valued Logic and Soft Computing

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 349))

Abstract

Clustering has become an important tool for every data scientist as it allows to perform exploratory data analysis and summarize large amounts of data. Specifically for text data, clustering faces other challenges derived from the high-dimensional space into which the data is represented. Furthermore and in spite of the fact that important contributions have already been made, scalability presents an important challenge when the whole-data-in-memory approach is no longer valid for real scenarios where data is collected in massive volumes. This chapter reviews the recent contributions on high-dimensional text data clustering with particular emphasis on scalability issues and also on the impact of the curse of dimensionality over the distance-based clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abbas, M. A. and Shoukry, A. A.: Cmune: A clustering using mutual nearest neighbors algorithm, In Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on, pp. 1192–1197.

    Google Scholar 

  2. Ackermann, M. R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C.: Streamkm++: A clustering algorithm for data streams. J. Exp. Algorithmics, 17, 2012.

    Google Scholar 

  3. Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large databases (VLDB ’03), Morgan Kaufmann, 2003, pp. 81–92.

    Google Scholar 

  4. Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for projected clustering of high dimensional data streams, Proceedings of the 30th international conference on Very large data bases (VLDB ’04), 2004, pp. 852–863.

    Google Scholar 

  5. Aggarwal, C. and Yu, P.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec., Vol. 29 (2), 2000, pp. 70–81.

    Article  Google Scholar 

  6. Aggarwal, C. C., Hinneburg, A., and Keim, D. A.: On the surprising behavior of distance metrics in high dimensional space, Springer, 2001.

    Google Scholar 

  7. Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S.: Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD ’99), ACM, 1999, pp. 61–72, New York.

    Google Scholar 

  8. Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, ACM, 1998, pp. 94–105, New York.

    Google Scholar 

  9. Albers, S. and Leonardi, S.: On-line algorithms, ACM Computing Surveys, Vol. 31 (3), 1999.

    Google Scholar 

  10. Assent, I.: Clustering high dimensional data, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 2 (4), 2012, pp. 340–350.

    Google Scholar 

  11. Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.: When is “nearest neighbor” meaningful? In Database Theory—ICDT’99, Springer, pp. 217–235, 1999.

    Google Scholar 

  12. Bishop, C.: Pattern recognition and machine learning, Vol. 4., Springer New York, 2006.

    Google Scholar 

  13. Bohm, C., Railing, K., Kriegel, H., and Kroger, P.: Density connected clustering with local subspace preferences. In Data Mining, 2004. ICDM’04. 4th IEEE International Conference on, pp. 27–34.

    Google Scholar 

  14. Broder, A. Z.: On the resemblance and containment of documents, In Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29.

    Google Scholar 

  15. Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.: Syntactic clustering of the web, Computer Networks and ISDN Systems, Vol. 29 (8), 1997, pp. 1157–1166.

    Article  Google Scholar 

  16. Charikar, M. S.: Similarity estimation techniques from rounding algorithms, Proceedings of the 34th annual ACM symposium on Theory of computing, ACM, 2002, pp. 380–388.

    Google Scholar 

  17. Chien, J.-T. and Chang, Y.-L.: Bayesian sparse topic model, Journal of Signal Processing Systems, Vol. 74 (3), 2014, pp. 375–389.

    Article  Google Scholar 

  18. Das, A., Datar, M., Garg, A., and Rajaram, S.: Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 271–280.

    Google Scholar 

  19. Dasgupta, A., Kumar, R., and Sarlós, T.: A sparse Johnson–Lindenstrauss transform, Proceedings of the 42nd ACM symposium on Theory of computing, 2010, pp. 341–350.

    Google Scholar 

  20. Eisenstein, J., Ahmed, A., and Xing, E.: Sparse additive generative models of text, Proceedings of the 28th International Conference on Machine Learning (ICML-11), New York, ACM, 2011, pp. 1041–1048

    Google Scholar 

  21. Ertöz, L., Steinbach, M., and Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, SDM, SIAM, 2003, pp. 47–58.

    Google Scholar 

  22. Friedman, J. and Meulman, J.: Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 66 (4), 2004, pp. 815–849.

    Google Scholar 

  23. Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O’Callaghan, L.: Clustering data streams: Theory and practice, IEEE Trans. on Knowl. and Data Eng., Vol 15 (3), 2003, pp. 515–528.

    Google Scholar 

  24. Guha, S., Rastogi, R., and Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Data Engineering, 1999. Proceedings 15th International Conference on, pp. 512–521.

    Google Scholar 

  25. Haveliwala, T., Gionis, A., and Indyk, P.: Scalable techniques for clustering the web, Proceedings of the 3rd International Workshop on the Web and Databases, 2000, pp. 129–134.

    Google Scholar 

  26. Houle, M. E.: Navigating massive data sets via local clustering, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’03), New York: ACM, 2003, pp. 547–552.

    Google Scholar 

  27. Houle, M. E.: The relevant-set correlation model for data clustering, Statistical Analysis and Data Mining, Vol. 1(3), 2008, pp. 157–176.

    Article  MathSciNet  Google Scholar 

  28. Houle, M. E., Kriegel, H.-P., Kröger, P., Schubert, E., and Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Berlin, Heidelberg: Springer-Verlag, 2010, pp. 482–500.

    Google Scholar 

  29. Indyk, P. and Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the 30th annual ACM symposium on Theory of computing, 1998, pp. 604–613.

    Google Scholar 

  30. Jarvis, R. and Patrick, E. A.: Clustering using a similarity measure based on shared near neighbors, Computers, IEEE Transactions on, Vol. C-22 (11), 1973, pp. 1025–1034.

    Google Scholar 

  31. Keogh, E. and Mueen, A.: Curse of Dimensionality, Springer US, Boston, MA., 2010, pp. 257–258.

    Google Scholar 

  32. Koga, H., Ishibashi, T., and Watanabe, T.: Fast hierarchical clustering algorithm using locality-sensitive hashing, In Discovery Science, 2004, pp. 114–128.

    Google Scholar 

  33. Koga, H., Ishibashi, T., and Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing, Knowledge and Information Systems, Vol. 12 (1), 2007, pp. 25–53.

    Article  MATH  Google Scholar 

  34. Kriegel, H., Kröger, P., and Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data, Vol. 3 (1), 2009, pp. 1:1–1:58.

    Google Scholar 

  35. Kriegel, H.-P. and Ntoutsi, E.: Clustering high dimensional data: Examining differences and commonalities between subspace clustering and text clustering—a position paper, SIGKDD Explor. Newsl., Vol. 15(2), 2014, pp. 1–8.

    Google Scholar 

  36. Larsson, M. O. and Ugander, J.: A concave regularization technique for sparse mixture models, In Advances in Neural Information Processing Systems, 2011, pp. 1890–1898.

    Google Scholar 

  37. Li, L., Wang, D., Li, T., Knox, D., and Padmanabhan, B.: Scene: a scalable two-stage personalized news recommendation system. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, 2011, pp. 125–134.

    Google Scholar 

  38. Li, P. and König, C.: Theory and applications of b-bit minwise hashing, Communications of the ACM, Vol. 54 (8), 2011, pp. 101–109.

    Article  Google Scholar 

  39. Li, P., Owen, A., and Zhang, C.-H.: One permutation hashing. Advances in Neural Information Processing Systems, 2012, pp. 3113–3121.

    Google Scholar 

  40. Luo, C., Li, Y., and Chung, S. M.: Text document clustering based on neighbors, Data & Knowledge Engineering, Vol. 68 (11), 2009, pp. 1271–1288.

    Article  Google Scholar 

  41. Manku, G. S., Jain, A., and Das Sarma, A.: Detecting near-duplicates for web crawling, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 141–150.

    Google Scholar 

  42. Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., and Kriegel, H.: Density-based projected clustering over high dimensional data streams, Proceedings of SDM ’12, SIAM, 2012, pp. 987–998.

    Google Scholar 

  43. O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R.: Streaming-data algorithms for high-quality clustering. Proceedings 18th International Conference on Data Engineering (ICDE ’02), IEEE Computer Society. 2002, pp. 685–694.

    Google Scholar 

  44. Radovanović, M., Nanopoulos, A., and Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. The Journal of Machine Learning Research, Vol. 11, 2010, pp. 2487–2531.

    MathSciNet  MATH  Google Scholar 

  45. Rovetta, S. and Masulli, F.: Shared farthest neighbor approach to clustering of high dimensionality, low cardinality data, Pattern Recognition, Vol. 39 (12), 2006, pp. 2415–2425.

    Article  MATH  Google Scholar 

  46. Schnitzer, D., Flexer, A., Schedl, M., and Widmer, G.: Local and global scaling reduce hubs in space. The Journal of Machine Learning Research, Vol. 13 (1), 2012, pp. 2871–2902.

    MathSciNet  MATH  Google Scholar 

  47. Schulman, L. J.: Clustering for edge-cost minimization (extended abstract), Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC ’00), New York, ACM, 2000. pp. 547–555,

    Google Scholar 

  48. Shrivastava, A. and Li, P.: Fast near neighbor search in high-dimensional binary data, In Machine Learning and Knowledge Discovery in Databases, 2012, pp. 474–489.

    Google Scholar 

  49. Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering, Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64.

    Google Scholar 

  50. Tomasev, N., Radovanović, M., Mladenic, D., and Ivanović, M. (2014). The role of hubness in clustering high-dimensional data. Knowledge and Data Engineering, IEEE Transactions on, Vol. 26 (3), 2014, pp. 739–751.

    Google Scholar 

  51. Wang, J., Shen, H., Song, J., and Ji, J.: Hashing for similarity search: A survey, arXiv preprint arXiv:1408.2927, 2014.

  52. Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J.: Feature hashing for large scale multitask learning. Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 1113–1120. ACM.

    Google Scholar 

  53. Xu, R. and Wunsch, D. I.: Survey of clustering algorithms. Neural Networks, IEEE Transactions on, Vol. 16 (3), 2005, pp. 645–678.

    Article  Google Scholar 

  54. Zamora, J., Mendoza, M., and Allende, H.: Hashing-based clustering in high dimensional data. Expert Systems with Applications, 2016.

    Google Scholar 

  55. Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: an efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD ’96), pp. 103–114. ACM Press.

    Google Scholar 

Download references

Acknowledgments

It is a great honor to me to become part of this homage book for a person who I sincerely admire. I met Professor Moraga in 2010 because of a seminary course that he gave at the Universidad Técnica Federico Santa María for doctoral students. After that I could appreciate his human warmth and constant eagerness to help others (me among them). I am quite sure he will not be comfortable if I write a long acknowledgement text, hence I just want to express my gratitude for his advices, discussions and great willingness. Finally, I just wanted to put a very important photo to me that was taken in my doctoral defense. Professor Moraga participated in the Commission, his observations were invaluable and made a much better work of my thesis.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Zamora .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this chapter

Cite this chapter

Zamora, J. (2017). Recent Advances in High-Dimensional Clustering for Text Data. In: Seising, R., Allende-Cid, H. (eds) Claudio Moraga: A Passion for Multi-Valued Logic and Soft Computing. Studies in Fuzziness and Soft Computing, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-319-48317-7_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-48317-7_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-48316-0

  • Online ISBN: 978-3-319-48317-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics