Improving Quality of Search Results Clustering with Approximate Matrix Factorisations

Osinski, Stanislaw

doi:10.1007/11735106_16

Improving Quality of Search Results Clustering with Approximate Matrix Factorisations

Stanislaw Osinski²²

Conference paper

1568 Accesses
21 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3936))

Abstract

In this paper we show how approximate matrix factorisations can be used to organise document summaries returned by a search engine into meaningful thematic categories. We compare four different factorisations (SVD, NMF, LNMF and K-Means/Concept Decomposition) with respect to topic separation capability, outlier detection and label quality. We also compare our approach with two other clustering algorithms: Suffix Tree Clustering (STC) and Tolerance Rough Set Clustering (TRC). For our experiments we use the standard merge-then-cluster approach based on the Open Directory Project web catalogue as a source of human-clustered document summaries.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Zamir, O., Etzioni, O.: Web document clustering: a feasibility demonstration. In: SIGIR 1998: Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pp. 46–54. ACM Press, New York (1998)
Google Scholar
Zamir, O.E.: Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results. PhD thesis, University of Washington (1999)
Google Scholar
Dong, Z.: Towards Web Information Clustering. PhD thesis, Southeast University, Nanjing, China (2002)
Google Scholar
Lang, N.C.: A tolerance rough set approach to clustering web search results. Master’s thesis, Faculty of Mathematics, Informatics and Mechanics, Warsaw University (2004)
Google Scholar
Kummamuru, K., Lotlikar, R., Roy, S., Singal, K., Krishnapuram, R.: A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web, pp. 658–665. ACM Press, New York (2004)
Google Scholar
Stefanowski, J., Weiss, D.: Carrot² and language properties in web search results clustering. In: AWIC 2003. LNCS, vol. 2663, pp. 240–249. Springer, Heidelberg (2003)
Google Scholar
Osiński, S., Stefanowski, J., Weiss, D.: Lingo: Search results clustering algorithm based on Singular Value Decomposition. In: Proceedings of the International IIS: Intelligent Information Processing and Web Mining Conference. Advances in Soft Computing, Zakopane, Poland, pp. 359–368. Springer, Heidelberg (2004)
Chapter Google Scholar
Osi´nski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intelligent Systems 20(3), 48–54 (2005)
Article Google Scholar
Lee, D., Seung, S.: Learning the parts of objects by non-negative matrix factorization. Nature 401, 788–791 (1999)
Article MATH Google Scholar
Li, S.Z., Hou, X.W., Zhang, H., Cheng, Q.: Learning spatially localized, parts-based representation. CVPR (1), 207–212 (2001)
Google Scholar
Dhillon, I., Modha, D.: Concept decompositions for large sparse text data using clustering. Machine Learning 42(1), 143–175 (2001)
Article MATH Google Scholar
Hearst, M.A., Pedersen, J.O.: Reexamining the cluster hypothesis: Scatter/gather on retrieval results. In: Proceedings of SIGIR 1996, 19th ACM International Conference on Research and Development in Information Retrieval, Zürich, CH, pp. 76–84 (1996)
Google Scholar
Zamir, O., Etzioni, O.: Grouper: a dynamic clustering interface to Web search results. Computer Networks 31(11–16), 1361–1374 (1999)
Article Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 267–273. ACM Press, New York (2003)
Chapter Google Scholar
Salton, G.: Automatic text processing: the transformation, analysis, and retrieval of information by computer. Addison-Wesley Longman Publishing Co., Inc., Boston (1989)
Google Scholar
Lee, D.D., Seung, H.S.: Algorithms for non-negative matrix factorization. Neural Information Processing Systems 13, 556–562 (2000)
Google Scholar
Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. In: KDD 2001: Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 245–250. ACM Press, New York (2001)
Google Scholar
Dom, B.E.: An information-theoretic external cluster-validity measure. Technical Report IBM Research Report RJ 10219, IBM (2001)
Google Scholar
Osiński, S.: Dimensionality reduction techniques for search results clustering. Master’s thesis, The University of Sheffield (2004)
Google Scholar
Xu, W., Gong, Y.: Document clustering by concept factorization. In: SIGIR 2004: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 202–209. ACM Press, New York (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Poznan Supercomputing and Networking Center, ul. Noskowskiego 10, 61-704, Poznan, Poland
Stanislaw Osinski

Authors

Stanislaw Osinski
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Queen Mary, University of London, London, UK
Mounia Lalmas
Department of Information Science, City University, Northampton Square, EC1V OHB, London, UK
Andy MacFarlane
Knowledge Media Institute, The Open University, MK7 6AA, Milton Keynes, UK
Stefan Rüger
Queen Mary University of London, UK
Anastasios Tombros
CWI, Amsterdam, The Netherlands
Theodora Tsikrika
Department of Computing, Imperial College London, South Kensington Campus, SW7 2AZ, London, UK
Alexei Yavlinsky

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Osinski, S. (2006). Improving Quality of Search Results Clustering with Approximate Matrix Factorisations. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds) Advances in Information Retrieval. ECIR 2006. Lecture Notes in Computer Science, vol 3936. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11735106_16

Download citation

DOI: https://doi.org/10.1007/11735106_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-33347-0
Online ISBN: 978-3-540-33348-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics