Recent Advances in High-Dimensional Clustering for Text Data

Zamora, Juan

doi:10.1007/978-3-319-48317-7_20

Juan Zamora⁴

Part of the book series: Studies in Fuzziness and Soft Computing ((STUDFUZZ,volume 349))

738 Accesses
2 Citations

Abstract

Clustering has become an important tool for every data scientist as it allows to perform exploratory data analysis and summarize large amounts of data. Specifically for text data, clustering faces other challenges derived from the high-dimensional space into which the data is represented. Furthermore and in spite of the fact that important contributions have already been made, scalability presents an important challenge when the whole-data-in-memory approach is no longer valid for real scenarios where data is collected in massive volumes. This chapter reviews the recent contributions on high-dimensional text data clustering with particular emphasis on scalability issues and also on the impact of the curse of dimensionality over the distance-based clustering methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abbas, M. A. and Shoukry, A. A.: Cmune: A clustering using mutual nearest neighbors algorithm, In Information Science, Signal Processing and their Applications (ISSPA), 2012 11th International Conference on, pp. 1192–1197.
Google Scholar
Ackermann, M. R., Märtens, M., Raupach, C., Swierkot, K., Lammersen, C., and Sohler, C.: Streamkm++: A clustering algorithm for data streams. J. Exp. Algorithmics, 17, 2012.
Google Scholar
Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for clustering evolving data streams, Proceedings of the 29th international conference on Very large databases (VLDB ’03), Morgan Kaufmann, 2003, pp. 81–92.
Google Scholar
Aggarwal, C., Han, J., Wang, J., and Yu, P.: A framework for projected clustering of high dimensional data streams, Proceedings of the 30th international conference on Very large data bases (VLDB ’04), 2004, pp. 852–863.
Google Scholar
Aggarwal, C. and Yu, P.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec., Vol. 29 (2), 2000, pp. 70–81.
Article Google Scholar
Aggarwal, C. C., Hinneburg, A., and Keim, D. A.: On the surprising behavior of distance metrics in high dimensional space, Springer, 2001.
Google Scholar
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., and Park, J. S.: Fast algorithms for projected clustering, Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD ’99), ACM, 1999, pp. 61–72, New York.
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications, In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data, SIGMOD ’98, ACM, 1998, pp. 94–105, New York.
Google Scholar
Albers, S. and Leonardi, S.: On-line algorithms, ACM Computing Surveys, Vol. 31 (3), 1999.
Google Scholar
Assent, I.: Clustering high dimensional data, Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, Vol. 2 (4), 2012, pp. 340–350.
Google Scholar
Beyer, K., Goldstein, J., Ramakrishnan, R., and Shaft, U.: When is “nearest neighbor” meaningful? In Database Theory—ICDT’99, Springer, pp. 217–235, 1999.
Google Scholar
Bishop, C.: Pattern recognition and machine learning, Vol. 4., Springer New York, 2006.
Google Scholar
Bohm, C., Railing, K., Kriegel, H., and Kroger, P.: Density connected clustering with local subspace preferences. In Data Mining, 2004. ICDM’04. 4th IEEE International Conference on, pp. 27–34.
Google Scholar
Broder, A. Z.: On the resemblance and containment of documents, In Compression and Complexity of Sequences 1997. Proceedings, pp. 21–29.
Google Scholar
Broder, A. Z., Glassman, S. C., Manasse, M. S., and Zweig, G.: Syntactic clustering of the web, Computer Networks and ISDN Systems, Vol. 29 (8), 1997, pp. 1157–1166.
Article Google Scholar
Charikar, M. S.: Similarity estimation techniques from rounding algorithms, Proceedings of the 34th annual ACM symposium on Theory of computing, ACM, 2002, pp. 380–388.
Google Scholar
Chien, J.-T. and Chang, Y.-L.: Bayesian sparse topic model, Journal of Signal Processing Systems, Vol. 74 (3), 2014, pp. 375–389.
Article Google Scholar
Das, A., Datar, M., Garg, A., and Rajaram, S.: Google news personalization: scalable online collaborative filtering, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 271–280.
Google Scholar
Dasgupta, A., Kumar, R., and Sarlós, T.: A sparse Johnson–Lindenstrauss transform, Proceedings of the 42nd ACM symposium on Theory of computing, 2010, pp. 341–350.
Google Scholar
Eisenstein, J., Ahmed, A., and Xing, E.: Sparse additive generative models of text, Proceedings of the 28th International Conference on Machine Learning (ICML-11), New York, ACM, 2011, pp. 1041–1048
Google Scholar
Ertöz, L., Steinbach, M., and Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data, SDM, SIAM, 2003, pp. 47–58.
Google Scholar
Friedman, J. and Meulman, J.: Clustering objects on subsets of attributes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), Vol. 66 (4), 2004, pp. 815–849.
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R., and O’Callaghan, L.: Clustering data streams: Theory and practice, IEEE Trans. on Knowl. and Data Eng., Vol 15 (3), 2003, pp. 515–528.
Google Scholar
Guha, S., Rastogi, R., and Shim, K.: Rock: a robust clustering algorithm for categorical attributes. Data Engineering, 1999. Proceedings 15th International Conference on, pp. 512–521.
Google Scholar
Haveliwala, T., Gionis, A., and Indyk, P.: Scalable techniques for clustering the web, Proceedings of the 3rd International Workshop on the Web and Databases, 2000, pp. 129–134.
Google Scholar
Houle, M. E.: Navigating massive data sets via local clustering, Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’03), New York: ACM, 2003, pp. 547–552.
Google Scholar
Houle, M. E.: The relevant-set correlation model for data clustering, Statistical Analysis and Data Mining, Vol. 1(3), 2008, pp. 157–176.
Article MathSciNet Google Scholar
Houle, M. E., Kriegel, H.-P., Kröger, P., Schubert, E., and Zimek, A.: Can shared-neighbor distances defeat the curse of dimensionality? Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM’10), Berlin, Heidelberg: Springer-Verlag, 2010, pp. 482–500.
Google Scholar
Indyk, P. and Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality, Proceedings of the 30th annual ACM symposium on Theory of computing, 1998, pp. 604–613.
Google Scholar
Jarvis, R. and Patrick, E. A.: Clustering using a similarity measure based on shared near neighbors, Computers, IEEE Transactions on, Vol. C-22 (11), 1973, pp. 1025–1034.
Google Scholar
Keogh, E. and Mueen, A.: Curse of Dimensionality, Springer US, Boston, MA., 2010, pp. 257–258.
Google Scholar
Koga, H., Ishibashi, T., and Watanabe, T.: Fast hierarchical clustering algorithm using locality-sensitive hashing, In Discovery Science, 2004, pp. 114–128.
Google Scholar
Koga, H., Ishibashi, T., and Watanabe, T.: Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing, Knowledge and Information Systems, Vol. 12 (1), 2007, pp. 25–53.
Article MATH Google Scholar
Kriegel, H., Kröger, P., and Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering, ACM Trans. Knowl. Discov. Data, Vol. 3 (1), 2009, pp. 1:1–1:58.
Google Scholar
Kriegel, H.-P. and Ntoutsi, E.: Clustering high dimensional data: Examining differences and commonalities between subspace clustering and text clustering—a position paper, SIGKDD Explor. Newsl., Vol. 15(2), 2014, pp. 1–8.
Google Scholar
Larsson, M. O. and Ugander, J.: A concave regularization technique for sparse mixture models, In Advances in Neural Information Processing Systems, 2011, pp. 1890–1898.
Google Scholar
Li, L., Wang, D., Li, T., Knox, D., and Padmanabhan, B.: Scene: a scalable two-stage personalized news recommendation system. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, ACM, 2011, pp. 125–134.
Google Scholar
Li, P. and König, C.: Theory and applications of b-bit minwise hashing, Communications of the ACM, Vol. 54 (8), 2011, pp. 101–109.
Article Google Scholar
Li, P., Owen, A., and Zhang, C.-H.: One permutation hashing. Advances in Neural Information Processing Systems, 2012, pp. 3113–3121.
Google Scholar
Luo, C., Li, Y., and Chung, S. M.: Text document clustering based on neighbors, Data & Knowledge Engineering, Vol. 68 (11), 2009, pp. 1271–1288.
Article Google Scholar
Manku, G. S., Jain, A., and Das Sarma, A.: Detecting near-duplicates for web crawling, Proceedings of the 16th international conference on World Wide Web, ACM, 2007, pp. 141–150.
Google Scholar
Ntoutsi, I., Zimek, A., Palpanas, T., Kröger, P., and Kriegel, H.: Density-based projected clustering over high dimensional data streams, Proceedings of SDM ’12, SIAM, 2012, pp. 987–998.
Google Scholar
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., and Motwani, R.: Streaming-data algorithms for high-quality clustering. Proceedings 18th International Conference on Data Engineering (ICDE ’02), IEEE Computer Society. 2002, pp. 685–694.
Google Scholar
Radovanović, M., Nanopoulos, A., and Ivanović, M.: Hubs in space: Popular nearest neighbors in high-dimensional data. The Journal of Machine Learning Research, Vol. 11, 2010, pp. 2487–2531.
MathSciNet MATH Google Scholar
Rovetta, S. and Masulli, F.: Shared farthest neighbor approach to clustering of high dimensionality, low cardinality data, Pattern Recognition, Vol. 39 (12), 2006, pp. 2415–2425.
Article MATH Google Scholar
Schnitzer, D., Flexer, A., Schedl, M., and Widmer, G.: Local and global scaling reduce hubs in space. The Journal of Machine Learning Research, Vol. 13 (1), 2012, pp. 2871–2902.
MathSciNet MATH Google Scholar
Schulman, L. J.: Clustering for edge-cost minimization (extended abstract), Proceedings of the 32nd Annual ACM Symposium on Theory of Computing (STOC ’00), New York, ACM, 2000. pp. 547–555,
Google Scholar
Shrivastava, A. and Li, P.: Fast near neighbor search in high-dimensional binary data, In Machine Learning and Knowledge Discovery in Databases, 2012, pp. 474–489.
Google Scholar
Strehl, A., Ghosh, J., and Mooney, R. (2000). Impact of similarity measures on web-page clustering, Workshop on Artificial Intelligence for Web Search (AAAI 2000), pp. 58–64.
Google Scholar
Tomasev, N., Radovanović, M., Mladenic, D., and Ivanović, M. (2014). The role of hubness in clustering high-dimensional data. Knowledge and Data Engineering, IEEE Transactions on, Vol. 26 (3), 2014, pp. 739–751.
Google Scholar
Wang, J., Shen, H., Song, J., and Ji, J.: Hashing for similarity search: A survey, arXiv preprint arXiv:1408.2927, 2014.
Weinberger, K., Dasgupta, A., Langford, J., Smola, A., and Attenberg, J.: Feature hashing for large scale multitask learning. Proceedings of the 26th Annual International Conference on Machine Learning, 2009, pp. 1113–1120. ACM.
Google Scholar
Xu, R. and Wunsch, D. I.: Survey of clustering algorithms. Neural Networks, IEEE Transactions on, Vol. 16 (3), 2005, pp. 645–678.
Article Google Scholar
Zamora, J., Mendoza, M., and Allende, H.: Hashing-based clustering in high dimensional data. Expert Systems with Applications, 2016.
Google Scholar
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: an efficient data clustering method for very large databases. Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD ’96), pp. 103–114. ACM Press.
Google Scholar

Download references

Acknowledgments

It is a great honor to me to become part of this homage book for a person who I sincerely admire. I met Professor Moraga in 2010 because of a seminary course that he gave at the Universidad Técnica Federico Santa María for doctoral students. After that I could appreciate his human warmth and constant eagerness to help others (me among them). I am quite sure he will not be comfortable if I write a long acknowledgement text, hence I just want to express my gratitude for his advices, discussions and great willingness. Finally, I just wanted to put a very important photo to me that was taken in my doctoral defense. Professor Moraga participated in the Commission, his observations were invaluable and made a much better work of my thesis.

Author information

Authors and Affiliations

Universidad de Valparaíso, Valparaíso, Chile
Juan Zamora

Authors

Juan Zamora
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Zamora .

Editor information

Editors and Affiliations

Escuela de IngenieríaInformática, Pontificia Universidad Católica de Valparaíso, Valparaíso, Chile
Rudolf Seising
Geschichte der Nat, "Ernst-Haeckel-Haus", Deutsches Museum, Munich, Germany
Héctor Allende-Cid

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Zamora, J. (2017). Recent Advances in High-Dimensional Clustering for Text Data. In: Seising, R., Allende-Cid, H. (eds) Claudio Moraga: A Passion for Multi-Valued Logic and Soft Computing. Studies in Fuzziness and Soft Computing, vol 349. Springer, Cham. https://doi.org/10.1007/978-3-319-48317-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-48317-7_20
Published: 21 October 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-48316-0
Online ISBN: 978-3-319-48317-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics