Fast Approximate Text Document Clustering Using Compressive Sampling

Park, Laurence A. F.

doi:10.1007/978-3-642-23783-6_36

Laurence A. F. Park²³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6912))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3165 Accesses
3 Citations
9 Altmetric

Abstract

Document clustering involves repetitive scanning of a document set, therefore as the size of the set increases, the time required for the clustering task increases and may even become impossible due to computational constraints. Compressive sampling is a feature sampling technique that allows us to perfectly reconstruct a vector from a small number of samples, provided that the vector is sparse in some known domain. In this article, we apply the theory behind compressive sampling to the document clustering problem using k-means clustering. We provide a method of computing high accuracy clusters in a fraction of the time it would have taken by directly clustering the documents. This is performed by using the Discrete Fourier Transform and the Discrete Cosine Transform. We provide empirical results showing that compressive sampling provides a 14 times increase in speed with little reduction in accuracy on 7,095 documents, and we also provide a very accurate clustering of a 231,219 document set, providing 20 times increase in speed when compared to performing k-means clustering on the document set. This shows that compressive clustering is a very useful tool that can be used to quickly compute approximate clusters.

Download to read the full chapter text

Chapter PDF

Compressive mining: fast and optimal data mining in the compressed domain

Article 11 June 2014

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Approximating Spectral Clustering via Sampling: A Review

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

Baraniuk, R., Davenport, M., DeVore, R., Wakin, M.: A simple proof of the restricted isometry property for random matrices. Constructive Approximation 28, 253–263 (2008), doi:10.1007/s00365-007-9003-x
Article MathSciNet MATH Google Scholar
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2001, pp. 245–250. ACM, New York (2001)
Google Scholar
Candes, E.J., Tao, T.: Decoding by linear programming. IEEE Transactions on Information Theory 51(12), 4203–4215 (2005)
Article MathSciNet MATH Google Scholar
Candes, E.J., Wakin, M.B.: An introduction to compressive sampling. IEEE Signal Processing Magazine 25(2), 21–30 (2008)
Article Google Scholar
Candes, E.J.: Compressive sampling. In: Proceedings of the International Congress of Mathematicians. European Mathematical Society, Madrid (2006)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB 1999, pp. 518–529. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Google Scholar
Goyal, V.K., Fletcher, A.K., Rangan, S.: Compressive sampling and lossy compression. IEEE Signal Processing Magazine 25(2), 48–56 (2008)
Article Google Scholar
Hartigan, J.A., Wong, M.A.: Algorithm as 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 28(1), 100–108 (1979)
MATH Google Scholar
Li, P., Church, K.W., Hastie, T.J.: Conditional random sampling: A sketch-based sampling technique for sparse data. In: Advances in Neural Information Processing Systems, vol. 19, p. 873 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing and Mathematics, University of Western Sydney, Australia
Laurence A. F. Park

Authors

Laurence A. F. Park
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Informatics and Telecommunications, University of Athens, Panepistimioupolis, Ilisia, 15784, Athens, Greece
Dimitrios Gunopulos
Google Switzerland GmbH, Brandschenkestrasse 110, 8002, Zurich, Switzerland
Thomas Hofmann
Department of Computer Science, University of Bari “Aldo Moro”, via Orabona 4, 70125, Bari, Italy
Donato Malerba
Deptartment of Informatics, Athens University of Economics and Business, Patision 76, 10434, Athens, Greece
Michalis Vazirgiannis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Park, L.A.F. (2011). Fast Approximate Text Document Clustering Using Compressive Sampling. In: Gunopulos, D., Hofmann, T., Malerba, D., Vazirgiannis, M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2011. Lecture Notes in Computer Science(), vol 6912. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23783-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-642-23783-6_36
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23782-9
Online ISBN: 978-3-642-23783-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Approximate Text Document Clustering Using Compressive Sampling

Abstract

Chapter PDF

Similar content being viewed by others

Compressive mining: fast and optimal data mining in the compressed domain

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Approximating Spectral Clustering via Sampling: A Review

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Fast Approximate Text Document Clustering Using Compressive Sampling

Abstract

Chapter PDF

Similar content being viewed by others

Compressive mining: fast and optimal data mining in the compressed domain

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Approximating Spectral Clustering via Sampling: A Review

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation