Skip to main content
Log in

Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Stopwords are commonly-occurred words such as articles, prepositions, and conjunctions, which are poor discriminators in representing the content of a sentence (or RSS news article), whereas stemmed words are words reduced to their grammatical root. From now on, unless stated otherwise, whenever we refer to words, we mean non-stop, stemmed words.

  2. From now on, whenever we use the term similar sentences (RSS news articles), we mean sentences (RSS news articles) that are semantically the same but different in terms of words used in the sentences (RSS news articles).

  3. The title of an RSS news article is also treated as a sentence when determining the degree of resemblance among RSS news articles.

  4. Standard deviation is a statistical measure that determines how spread out (i.e., the variability of) the values of a data set are.

  5. If there is more than one keyword in a cluster that has the highest weight among all the keywords in the cluster, we treat them all as “representative” keywords.

  6. To speed up the recursive partitioning process for finding the appropriate labels without sacrificing the quality, we remove stopwords, perform stemming on the words in the RSS news articles, and use the stems as labels.

  7. The sources of the RSS news articles in RSSrds include cnn.com, hosted.ap.org, news.yahoo.com, prnewswire.com, www.abcnews.go.com, www.cbsnews.com, and www.iht.com, to name a few.

  8. Only single-topic RSS news articles were used in this empirical study, which follows the evaluation premise as detailed in Fung et al. (2003; Hu et al. (2008; Xu and Gong (2004), in which multiple-topic RSS news articles are not considered.

  9. A natural class is an element of a set of classes K in which the labels of the documents within each class, i.e., terms or phrases used for describing the contents of the documents within a given class, such as “sports” or “entertainment”, are known in advance.

  10. Delicious (delicious.com) is a social bookmarking web service that was developed to aid users in storing, sharing, and discovering web bookmarks. Delicious uses a non-hierarchical classification system which allows its users to tag their bookmarks with freely chosen index terms.

References

  • Croft, B., Metzler, D., & Strohman, T. (2010). Search engines: information retrieval in practice. Addison Wesley.

  • Dhillon, I., & Modha, D. (2001). Concept decompositions for large sparse text data using clustering. Machine Learning, 42, 143–175.

    Article  MATH  Google Scholar 

  • Dhillon, I., Mallela, S., & Kumar, R. (2002). Enhanced word clustering for hierarchical text classification. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 191–200).

  • Fung, B., Wang, K., & Ester, M. (2003). Hierarchical document clustering using frequent itemsets. In Proceedings of SIAM international conference on data mining (ICDM) (pp. 59–70).

  • Hammouda, K., & Kamel, M. (2002). Phrase-based document similarity based on an index graph model. In Proceedings of IEEE international conference on data mining (ICDM) (pp. 203–210).

  • Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., et al. (2008). Enhancing text clustering by leveraging wikipedia semantics. In Proceedings of ACM Conference on Research and Development in Information Retrieval (SIGIR) (pp. 179–186).

  • Jain, A., Murty, M., & Flynn P. (1999) Data Clustering: A Review. ACM Computing Surveys, 31(3), 264–323.

    Article  Google Scholar 

  • Koberstein, J., & Ng, Y.-K. (2006). Using word clusters to detect similar RSS news articles. In Proceedings of the international conference on knowledge science, engineering and management (KSEM) (pp. 215–228). LNAI 4092.

  • Larsen, B., & Aone, C. (1999). Fast and effective text mining using linear-time document clustering. In Proceedings of ACM international conference on knowledge discovery and data mining (SIGKDD) (pp. 16–22).

  • Li, X., Zaiane, O., & Li. Z. (2006). A comparative study on text clustering methods. In Proceedings of advanced data mining and applications (pp. 644–651).

  • Lim, S., & Ng, Y.-K. (2005). Categorization and information extraction of multilingual HTML documents. In Proceedings of the 9th international database engineering and application symposium (IDEAS) (pp. 415–422).

  • Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002). Document clustering with cluster refinement and model selection capabilities. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 191–198).

  • Mitchell, T. (1997). Machine learning. McGraw Hill.

  • Ogawa, Y., Morita, T., & Kobayashi, K. (1991). A fuzzy document retrieval system using the keyword connection matrix and a learning method. Fuzzy Sets and Systems, 39, 163–179.

    Article  MathSciNet  Google Scholar 

  • Pera, M., & Ng, Y.-K. (2009). Synthesizing correlated RSS news articles based on a fuzzy equivalence relation. International Journal of Web Information Systems (IJWIS), 5(1), 77–109.

    Article  Google Scholar 

  • Shafer, G. (1976). A mathematical theory of evidence. Princeton University Press.

  • Slonim, N., Friedman, N., & Tishby, N. (2002) Unsupervised document classification using sequential information maximization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 129–136).

  • Slonim, N., & Tishby, N. (2001). The power of word clusters for text classification. In Proceedings of the 23rd European colloquium on information retrieval research (ECIR) (pp. 191–2000.

  • Xu, W., & Gong, Y. (2004). Document clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).

  • Xu, W., Liu, X., & Gong, Y. (2003). News article clustering based on non-negative matrix factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 267–273).

  • Xu, W., & Gong, Y. (2004). News article clustering by concept factorization. In Proceedings of ACM conference on research and development in information retrieval (SIGIR) (pp. 202–209).

  • Zheng, X., He, P., Tian, M., & Yuan, F. (2003). Algorithm of documents clustering based on minimum spanning tree. In Proceedings of the 2nd international conference on machine learning and cybernetics (pp. 199–203).

  • Zhong, S., & Ghosh, J. (2005). A comparative study of generative models for document clustering. Knowledge and Information Systems, 8(3), 374–384.

    Article  Google Scholar 

  • Zwillinger, D., Krantz, S., & Rosen, K. (Eds.) (1996). Standard mathematical tables and formulae (30th edition). CRC Press.

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their constructive comments and suggestions, which guided us to improve the quality of the article.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yiu-Kai Dennis Ng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pera, M.S., Ng, YK.D. Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles. J Intell Inf Syst 39, 513–534 (2012). https://doi.org/10.1007/s10844-012-0201-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-012-0201-z

Keywords

Navigation