Classification of Web Documents using Fuzzy Logic Categorical Data Clustering

  • George E. Tsekouras
  • Christos Anagnostopoulos
  • Damianos Gavalas
  • Economou Dafhi
Part of the IFIP The International Federation for Information Processing book series (IFIPAICT, volume 247)


We propose a categorical data fuzzy clustering algorithm to classify web documents. We extract a number of words for each thematic area (category) and then, we treat each word as a multidimensional categorical data vector. For each category, we use the algorithm to partition the available words into a number of clusters, where the center of each cluster corresponds to a word. To calculate the dissimilarity measure between two words we use the Hamming distance. Then, the classification of a new document is accomplished in two steps. Firstly, we estimate the minimum distance between this document and all the cluster centers of each category. Secondly, we select the smallest of the above minimum distance and we classify the document in the category that corresponds to this distance.


Cluster Center Categorical Object Inverse Document Frequency Index Cluster Validity Statistical Natural Language Processing 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Smith, K.A, and Ng, A.: Web page clustering using a self-organizing map of user navigation patterns, Decision Support Systems 35 (2003) 245–256CrossRefGoogle Scholar
  2. 2.
    Macskassy, S. A., Banerjee, A., Davison, B. D., and Hirsh, H.: Human performance on clustering web pages: a preliminary study, Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (1998)Google Scholar
  3. 3.
    Anagnostopoulos, I., Anagnostopoulos, C, Loumos, V., and Kayafas, E.: Classifying web pages employing a probabilistic neural network, IEE Proceedings on Software 151(3) (2004) 139–150CrossRefGoogle Scholar
  4. 4.
    Qi, D., and Sun, B.: A genetic k-means approach for automated web page classification, Proceedings of the 2004 IEEE International Conference on Information Reuse and Integration (2004) 241–246Google Scholar
  5. 5.
    Jain, A.K., Murty, M.N., and Flynn, P.J.: Data clustering: a review, ACM Computing Survey 31(3) (199) 264–323Google Scholar
  6. 6.
    Runkler, T.A., and Bezdek, J.C.: Web mining with relational clustering, International Journal of Approximate Reasoning 32 (2003) 217–236MATHCrossRefGoogle Scholar
  7. 7.
    Bezdek, J.C., and Pal, K.: Fuzzy models for pattern recognition: methods that search for structures in data, IEEE Press (1992), New York, NYGoogle Scholar
  8. 8.
    Manning, CD., and Schutze, H.: Foundations of statistical natural language processing, MIT Press (1999), Cambridge, MAGoogle Scholar
  9. 9.
    Punin, J.R., Krishnamoorthy, M.S, and Zaki, M.J.: Web usage mining-languages and algorithms, Technical Report, Rensselaer Polytechnic Institute, NY (2001)Google Scholar
  10. 10.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals, Sov. Phys. Dokl. 6 (1966) 705–710Google Scholar
  11. 11.
    Z. Huang, and M. K. Ng, A fuzzy k-modes algorithm for clustering categorical data, IEEE Transactions on Fuzzy Systems, Vol. 7, no 4, 1999, pp. 446–452CrossRefGoogle Scholar
  12. 12.
    Tsekouras, G. E., Papageorgiou, D., Kotsiantis, S., Kalloniatis, C, and Pintelas, P.: Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data, International Journal of Computational Intelligence 1(2) (2004) 147–151Google Scholar
  13. 13.
    Chen, J., Miculcic, A., and Kraft, D.H.: An integrated approach to information retrieval with fuzzy clustering and inferencing, in Knowledge Management in Fuzzy DataBases, Pons, O., Vila, M.A., and Kacprzyk, J. (Eds), Physic Verlag, Vol. 163 (2000)Google Scholar
  14. 14.
    Jones, K.S.: A statistical interpretation of tern specificity and its application in retrieval, J. Domument 28(1) (1972) 11–20Google Scholar
  15. 15.
    Huang, Z.: Extensions of the k-means algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2 (1998) 283–304CrossRefGoogle Scholar

Copyright information

© International Federation for Information Processing 2007

Authors and Affiliations

  • George E. Tsekouras
    • 1
  • Christos Anagnostopoulos
    • 1
  • Damianos Gavalas
    • 1
  • Economou Dafhi
    • 1
  1. 1.Department of Cultural Techonoly and Communication, Laboratory of Intelligent MultimediaUniversity of the AegeanMytilene, Lesvos IslandGreece

Personalised recommendations