New Approach for Automated Categorizing and Finding Similarities in Online Persian News

  • Naser Ezzati Jivan
  • Mahlagha Fazeli
  • Khadije Sadat Yousefi
Part of the Communications in Computer and Information Science book series (CCIS, volume 96)


The Web is a great source of information where data are stored in different formats, e.g., web-pages, archive files and images. Algorithms and tools which automatically categorize web-pages have wide applications in real-life situations. A web-site which collects news from different sources can be an example of such situations. In this paper, an algorithm for categorizing news is proposed. The proposed approach is specialized to work with documents (news) written in the Persian language but it can be easily generalized to work with documents in other languages, too. There is no standard test-bench or measure to evaluate the performance of this kind of algorithms as the amount of similarity between two documents (news) is not well-defined. To test the performance of the proposed algorithm, we implemented a web-site which uses the proposed approach to find similar news. Some of the similar news items found by the algorithm has been reported.


Categorization of web pages category automatic categorization of Persian news feature similarity clustering structure of web pages 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Kwon, O.-W., Lee, J.-H.: Web Page Classification Based on K-Nearest Neighbor Approach. In: Proceedings of the Fifth International Workshop on Information Retrieval With Asian Languages, pp. 9–15. ACM, New York (2000)Google Scholar
  2. 2.
    Yang, Y., Lui, X.: A Reexamination of Text Categorization Methods. In: Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (SIGIR 1999), pp. 42–49. University of California, Berkeley (1999)Google Scholar
  3. 3.
    Lewis, D.D., Ringuette, M.: A Classification of Two Learning Algorithms for Text Categorization. In: Third Annual Symposium on Document Analysis and Information Retrieval (SDAIR 1994), Las Vegas, USA, pp. 81–93 (1994)Google Scholar
  4. 4.
    McCallum, A., Nigam, K.: A Comparison of Event Models for Naïve Bayes Text Classification. In: AAAI 1998 Workshop on Learning For Text Categorization (1998)Google Scholar
  5. 5.
    Combarro, E.F., et al.: Introducing a Family of Linear Measures for Feature Selection in Text Categorization. IEEE Transactions on Knowledge and Data Engineering 17, 1223–1232 (2005)CrossRefGoogle Scholar
  6. 6.
    Apte, C., Damerau, F., Weiss, S.M.: Automated Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems 12, 233–251 (1994)CrossRefGoogle Scholar
  7. 7.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. In: Proceedings of the Seventh International Conference on Information and Knowledge Management (CIKM 1998), pp. 148–155. ACM, New York (1998)Google Scholar
  8. 8.
    Weigend, A.S., Weiner, E.D., Peterson, J.O.: Exploiting Hierarchy on Text Categorization. Information Retrieval 1, 193–216 (1999)Google Scholar
  9. 9.
    Guha, S., Rastogi, R., Shim, K.: Cure: An Efficient Clustering Algorithm for Large Database. In: Proceedings of the ACM SIGMOD Conference on Management of Data, pp. 73–84. ACM, Seattle (1998)Google Scholar
  10. 10.
    Tokunaga, T., Makoto, I.: Text Categorization Based on Weighted Inverse Document Frequency, Special Interest Groups and Information Process Society of Japan, SIG-IPSJ (1994)Google Scholar
  11. 11.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: An Efficient Data Clustering Method for Very Large Databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 104–114 (1996)Google Scholar
  12. 12.
    Chan, C.-H., Sun, A., Lim, E.-P.: Automated Online News Classification With Personalization. In: Center for Advanced Information Systems. Nanyang Technological University Nanyang Avenue, Singapore (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Naser Ezzati Jivan
    • 1
  • Mahlagha Fazeli
    • 2
  • Khadije Sadat Yousefi
    • 2
  1. 1.National Library and Archives of the I.R of IranIran
  2. 2.Iranian University of Science and TechnologyIran

Personalised recommendations