An Analysis of Constructed Categories for Textual Classification Using Fuzzy Similarity and Agglomerative Hierarchical Methods

  • Marcus V. C. GuelpeliEmail author
  • Ana Cristina Bicharra Garcia
  • Flavia Cristina Bernardini
Part of the Advanced Information and Knowledge Processing book series (AI&KP)


Ambiguity is a challenge faced by systems that handle natural language. To assuage the issue of linguistic ambiguities found in text classification, this work proposes a text categorizer using the methodology of Fuzzy Similarity. The clustering algorithms Stars and Cliques are adopted in the Agglomerative Hierarchical method and they identify the groups of texts by specifying some type of relationship rule to create categories based on the similarity analysis of the textual terms. The proposal is based on the methodology suggested, categories can be created from the analysis of the degree of similarity of the texts to be classified, without needing to determine the number of initial categories. The combination of techniques proposed in the categorizer’s steps brought satisfactory results, proving to be efficient in textual classification.


Fuzzy Logic Text Mining Stop Criterion Source Text Hierarchical Method 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Aldenderfe, M.S., Mark, R.K., Aldenderfe, S.: Cluster Analysis, p. 88. SAGE University, Beverly Hills (1978) Google Scholar
  2. 2.
    Arora, R., Bangarole, P.: Text mining: classification & clustering of articles related to sports. In: Proceedings of the 43rd Annual Southeast Regional Conference ACM-SE 43, vol. 1. ACM, New York (2005) Google Scholar
  3. 3.
    Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002) Google Scholar
  4. 4.
    Cross, V.: Fuzzy information retrieval. Journal of Intelligent Information Systems 3, 29–56 (1994) CrossRefGoogle Scholar
  5. 5.
    Dagan, I., Feldman, R., Hirsh, H.: Keyword-based browsing and analysis of large document sets. In: Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval—SDAIR, Las Vegas, Nevada, pp. 191–208 (1996) Google Scholar
  6. 6.
    Everitt, B.S., Dunn, G.: Applied Multivariate Data Analysis, 2nd edn. Edward Arnold, London (2000). Google Scholar
  7. 7.
    Fasulo, D.: An analysis of recent work on clustering algorithms. Technical report, Univ. of Washington, Washington, DC (1999). Google Scholar
  8. 8.
    Fávero, L.: Coesão e Coerência Textuais. Ática, São Paulo (2000). In Portuguese Google Scholar
  9. 9.
    Fayyad, U., Uthurusamy, R.: Data mining and knowledge discovery in databases (introduction to the Special Issue) Editorial. Data Mining and Knowledge Discovery. Communications of the ACM 39(11), 24–26 (1996) CrossRefGoogle Scholar
  10. 10.
    Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1997) CrossRefGoogle Scholar
  11. 11.
    Frawley, W.J., Piatestsky, S.G., Matheus, C.: Knowledge discovery in data bases: An overview. AI Magazine 13(3), 57–70 (1992). Google Scholar
  12. 12.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 1st edn. Morgan Kaufmann, New York (2001) Google Scholar
  13. 13.
    Hearst, M.A.: Automated Discovery of WordNet Relations. MIT University Press, Cambridge (1998) Google Scholar
  14. 14.
    Hellmann, M.: Fuzzy logic introduction. Université de Rennes (2001) Google Scholar
  15. 15.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988). zbMATHGoogle Scholar
  16. 16.
    Jianan, W., Rangaswamy, A.: A fuzzy set model of consideration set formation calibrated on data from an online supermarket. EBusiness research Center Working Paper, No. 5, 1999 Google Scholar
  17. 17.
    Karypis, G., Han, S.H.E.: Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999) CrossRefGoogle Scholar
  18. 18.
    Keogh, E., Kasetty, S.: On the need for time series data mining benchmarks: A survey and empirical demonstration. In: Proc. of 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, 23–26 July 2002, pp. 102–111. ACM, New York (2002) Google Scholar
  19. 19.
    Klir, G.J., Folger, T.A.: Fuzzy Sets, Uncertainty, and Information. Prentice-Hall, Englewood Cliffs (1988) zbMATHGoogle Scholar
  20. 20.
    Kowalski, G.: Information Retrieval Systems: Theory and Implementation. Kluwer Academic, Norwell (1997) zbMATHGoogle Scholar
  21. 21.
    Kwok, R.C., Ma, J., Zhou, D.: Improving group decision making: A fuzzy GSS approach. IEEE Transactions on Systems, Man, and Cybernetics—Part C: Applications and Reviews 32, 54–63 (2002) CrossRefGoogle Scholar
  22. 22.
    Mitchell, T.M.: Machine Learning. McGraw-Hill Series in Computer Science. McGraw-Hill, New York (1997) zbMATHGoogle Scholar
  23. 23.
    Mitra, S., Acharya, T.: Data Mining: Multimedia, Soft Computing, and Bioinformatics. Wiley, New York (2003) Google Scholar
  24. 24.
    Moscarola, J., Bolden, R.: From the data mine to the knowledge mill: applying the principles of lexical analysis to the data mining and knowledge discovery process. Technical report, Université de Savoie (1998) Google Scholar
  25. 25.
    Oliveira, H.M.: Seleção de entes complexos usando lógica difusa. Dissertation (Masters in Computer Science), Instituto de Informática (1996). In Portuguese Google Scholar
  26. 26.
    Pardo, T.A.S.: Dmsumm: Um gerador automático de sumários. Master’s thesis, Universidade Federal de São Carlos, São Carlos (2002). In Portuguese Google Scholar
  27. 27.
    Pottenger, W.M., Yang, T.: Dmsumm: Um gerador automático de sumários. Detecting emerging concepts in textual data mining. In: Berry, M. (ed.) Computational Information Retrieval. SIAM, Philadelphia (2001). In Portuguese Google Scholar
  28. 28.
    Rohf, F.J., Sokal, R.R.: Statistical Tables, 2nd edn. W.H. Freeman, San Francisco (1981) Google Scholar
  29. 29.
    Salton, G.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983) zbMATHGoogle Scholar
  30. 30.
    Silva, C.M., Vidigal, M.C., Vidigal Filho, P.S., Scapim, C.A., Daros, E., Silvério, L.: Genetic diversity among sugarcane clones (saccharum spp.). Scientiarum Agronomy 27, 315–319 (2005) Google Scholar
  31. 31.
    Snedecor, G.W.: Calculation and interpretation of analysis of variance and covariance (1934) Google Scholar
  32. 32.
    Tan, A.H.: Text mining: the state of the art and the challenges. In: Workshop on Knowledge Discovery from Advanced Databases. Lecture Notes in Computer Science, pp. 65–70. Springer, Berlin (1999) Google Scholar
  33. 33.
    Tsaur, S.H., Chang, T.Y., Yen, C.H.: The evaluation of airline service quality by fuzzy MCDM. Tourism Management 23(2), 107–115 (2007). Available at: Accessed on June 23, 2007. Lecture Notes in Computer Science, vol. 1574 CrossRefGoogle Scholar
  34. 34.
    Velickov, S.: Textminer theoretical background. (2004). Accessed on September 10, 2007
  35. 35.
    Vianna, D.S.: Heurísticas híbridas para o problema da logenia. PhD Thesis, Pontifícia Universidade Católica—PUC, Rio de Janeiro, Brazil (2004). In Portuguese Google Scholar
  36. 36.
    Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes. Van Nostrand Reinhold, New York (1994) zbMATHGoogle Scholar
  37. 37.
    Wives, L.K.: Um estudo sobre agrupamento de documentos textuais em processamento de informações não estruturadas usando técnicas de clustering. Master’s thesis, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil (1999). In Portuguese Google Scholar
  38. 38.
    Wives, L.K.: Utilizando conceitos como descritores de textos para o processo de identificação de conglomerados (clustering) de documentos. PhD Thesis, Universidade Federal do Rio Grande do Sul.Programa de Pós-graduação em Computação, Porto Alegre, RS, Brazil (2004). In Portuguese Google Scholar
  39. 39.
    Wives, L.K., Rodrigues, N.A.: Eurekha. Revista Eletrônica da Escola de Administração da UFRGS (READ) 6(5) (2000). In Portuguese Google Scholar
  40. 40.
    Zadeh, L.A.: Fuzzy sets. Information and Control 8, 338–353 (1965) MathSciNetzbMATHCrossRefGoogle Scholar
  41. 41.
    Zadeh, L.A.: Outline of a new approach to the analysis of complex systems and decision processes. Transactions on Systems, Man and Cybernetics 3, 28–44 (1973) MathSciNetzbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag London 2010

Authors and Affiliations

  • Marcus V. C. Guelpeli
    • 1
    Email author
  • Ana Cristina Bicharra Garcia
    • 1
  • Flavia Cristina Bernardini
    • 2
  1. 1.Departamento de Ciência da Computação, Instituto de Computação—ICUniversidade Federal Fluminense—UFFSão DomingosBrazil
  2. 2.Departamento de Ciência e Tecnologia—RCT, Pólo Universitário de Rio das Ostras—PUROUniversidade Federal Fluminense—UFFRio das OstrasBrazil

Personalised recommendations