An Approach to Clustering Abstracts

  • Mikhail Alexandrov
  • Alexander Gelbukh
  • Paolo Rosso
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


Free access to full-text scientific papers in major digital libraries and other web repositories is limited to only their abstracts consisting of no more than several dozens of words. Current keyword-based techniques allow for clustering such type of short texts only when the data set is multi-category, e.g., some documents are devoted to sport, others to medicine, others to politics, etc. However, they fail on narrow domain-oriented libraries, e.g., those containing all documents only on physics, or all on geology, or all on computational linguistics, etc. Nevertheless, just such data sets are the most frequent and most interesting ones. We propose simple procedure to cluster abstracts, which consists in grouping keywords and using more adequate document similarity measure. We use Stein’s MajorClust method for clustering both keywords and documents. We illustrate our approach on the texts from the Proceedings of a narrow-topic conference. Limitations of our approach are also discussed. Our preliminary experiments show that abstracts cannot be clustered with the same quality as full texts, though the achieved quality is adequate for many applications; accordingly, we suggest Makagonov’s proposal that digital libraries should provide document images of full texts of the papers (and not only abstracts) for open access via Internet, in order to help in search, classification, clustering, selection, and proper referencing of the papers.


Digital Library Near Neighbor Document Collection Document Image Short Document 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alexandrov, M., Blanco, X., Makagonov, P.: Testing Word Similarity: Language Independent Approach with Examples from Romance. In: Meziane, F., Métais, E. (eds.) NLDB 2004. LNCS, vol. 3136, pp. 223–234. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  2. 2.
    Alexandrov, M., Gelbukh, A., Rosso, P.: Clustering Very Short Documents based on Grouping Keywords. In: Abstracts of the 30-th Latin-American Conf. on Informatics, Univ. Edition, Peru, p. 133 (2004)Google Scholar
  3. 3.
    Baeza-Yates, R., Ribero-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  4. 4.
    Eissen, S., Stein, M.B.: Analysis of Clustering Algorithms for Web-based Search. In: Karagiannis, D., Reimer, U. (eds.) PAKM 2002. LNCS (LNAI), vol. 2569, pp. 168–178. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Gelbukh, A. (ed.): CICLing 2002. LNCS, vol. 2276. Springer, Heidelberg (2002), zbMATHGoogle Scholar
  6. 6.
    Hardy, A., Andre, P.: An investigation of nine procedures for detecting the structure in a data set. In: Advances in data science and classification. Studies in Classification, Data Analysis and Knowledge Organization, pp. 29–36. Springer, Heidelberg (1998)Google Scholar
  7. 7.
    Hartigan, J.: Clustering Algorithms. Wiley, Chichester (1975)zbMATHGoogle Scholar
  8. 8.
    Hynek, J., Jezek, K., Rohlikm, O.: Short Document Categorization – Itemsets Method. In: PKDD-2000. LNCS, p. 6. Springer, Heidelberg (2000)Google Scholar
  9. 9.
    Kang, B.-Y., Kim, H.-J., Lee, S.-J.: Performance Analysis of Semantic Indexing in Text Retrieval. In: Gelbukh, A. (ed.) CICLing 2004. LNCS, vol. 2945, pp. 433–436. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  10. 10.
    Makagonov, P., Alexandrov, M., Sboychakov, K.: Keyword-based technology for clustering short documents. In: Selected Papers. Computing Research, CIC-IPN, Mexico, pp. 105–114 (2000)Google Scholar
  11. 11.
    Makagonov, P., Alexandrov, M., Sboychakov, K.: A toolkit for development of the domainoriented dictionaries for structuring document flows. In: Data Analysis, Classification, and Related Methods, Studies in classification, data analysis, and knowledge organization, pp. 83–88. Springer, Heidelberg (2000)Google Scholar
  12. 12.
    Makagonov, P., Alexandrov, M., Gelbukh, A.: Clustering Abstracts instead of Full Texts. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2004. LNCS (LNAI), vol. 3206, pp. 129–135. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  13. 13.
    Manning, D.C., Schutze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (1999)zbMATHGoogle Scholar
  14. 14.
    Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  15. 15.
    Salton, G., Buckley, C.: Term-weighted approaches in autometic retrieval. Information Processing in Management 24(5), 513–523 (1988)CrossRefGoogle Scholar
  16. 16.
    Solomon, G.: Data dependent methods of cluster analysis. In: Classification and Clustering, pp. 129–147. Academic Press, London (1977) (Russian version)Google Scholar
  17. 17.
    Stein, B., Niggemann, O.: On the Nature of Structure and Its Identification. In: Widmayer, P., Neyer, G., Eidenbenz, S. (eds.) WG 1999. LNCS, vol. 1665, pp. 122–134. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  18. 18.
    Stein, B., Eissen, S.M.: Document Categorization with MajorClust. In: Proc. 12th Workshop on Information Technology and Systems, Tech. Univ. of Barcelona, Spain, p. 6 (2002)Google Scholar
  19. 19.
    Stein, B., Eissen, S.M.z.: Automatic document categorization. In: Günter, A., Kruse, R., Neumann, B. (eds.) KI 2003. LNCS (LNAI), vol. 2821, pp. 254–266. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  20. 20.
    Stein, B., Eissen, S.M., Wissbrock, F.: On Cluster Validity and the Information Need of Users. In: Proc. 3-rd IASTED Intern. Conf. on Artificial Intelligence and Applications (AIA 2003), pp. 216–221. Acta Press (2003)Google Scholar
  21. 21.
    Strzalkowski, T. (ed.): Natural Language and Information Retrieval. Kluwer Academic Publishers, Dordrecht (1999)Google Scholar
  22. 22.
    Zizka, J., Bourek, A.: Automated Selection of Interesting Medical Text Documents by the TEA Text Analyzer. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 402–404. Springer, Heidelberg (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Mikhail Alexandrov
    • 1
    • 2
  • Alexander Gelbukh
    • 1
  • Paolo Rosso
    • 2
  1. 1.Center for Computing ResearchNational Polytechnic InstituteMexico
  2. 2.Polytechnic University of ValenciaSpain

Personalised recommendations