Query Refinement Through Lexical Clustering of Scientific Textual Databases

  • Eric SanJuan
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3513)


TermWatch system automatically extracts multi word terms from scientific texts based on morphological analysis and relates them through linguistic variations. The resulting terminological network is clustered based on a 3-level hierarchical graph algorithm and mapped onto a 2D space. Clusters are automatically labeled based on variation activity. After a precise review of the methodology, this paper evaluates in the context of querying a scientific textual database, the overlap of terms and cluster labels with the keywords selected by human indexers as well as the set of possible queries based on the clustering output. The results show that linguistic variation paradigm is a robust way of automatically extracting and structuring a user comprehensive terminological resource for query refinement.


Latent Semantic Analysis Cluster Label Probabilistic Latent Semantic Analysis Single Link Cluster Cluster Output 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: Proceedings of the Third Annual International Conference on Research in Computational Molecular Biology, Lyon, France, April 11-14, pp. 33–42. ACM, New York (1999)CrossRefGoogle Scholar
  2. 2.
    Baeza-Yates, Ribeiro–Neto, B.: Query operations. In: Modern Information retrieval, pp. 117–139. ACM Press, New York (1999)Google Scholar
  3. 3.
    Berry, A., Kaba, B., Nadif, M., SanJuan, E., Sigayret, A.: Classification et désarticulation de graphes de termes. In: 7th International conference on Textual Data Statistical Analysis (JADT 2004), Leuven, Belgium, March 10-12, pp. 160–170 (2004)Google Scholar
  4. 4.
    Blyth, T.S., Janowitz, M.F.: Residuation Theory. Pergamon Press, Oxford (1972)zbMATHGoogle Scholar
  5. 5.
    Buckley, C., Salton, G., Allen, J.: Automatic query expansion using SMART: TREC-3. In: Harman, D.K. (ed.) The Third Text Retrieval Conference (TREC-3), U.S. Department of Commerce (1995)Google Scholar
  6. 6.
    Callon, M., Courtial, J.-P., Turner, W., Bauin, S.: From translation to network: The co-word analysis. Scientometrics 5(1) (1983)Google Scholar
  7. 7.
    Celeux, G., Govaert, G.: Comparison of the mixture and the classification maximum likehood. In clusters analysis. Journal of Statistical Computation and simulation 47, 127–146 (1993)CrossRefGoogle Scholar
  8. 8.
    Courtial, J.-P.: Introduction à la scientométrie. Anthropos – Economica, Paris, p. 135 (1990) Google Scholar
  9. 9.
    Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: a Cluster-based Approach to Browsing Large Document Collections. In: Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 318–329 (1992)Google Scholar
  10. 10.
    Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: Resnik, P., Klavans, J. (eds.) The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pp. 49–66. MIT Press, Cambridge (1996)Google Scholar
  11. 11.
    Dobrynin, V., Patterson, D., Rooney, N.: Contextual Document Clustering. In: Proceedings of the European Conference on Information Retrieval, Sunderland, UK, April 5-7, pp. 167–180 (2004)Google Scholar
  12. 12.
    Feldman, R., Fresko, M., Kinar, Y.: Text mining at the term level. In: Żytkow, J.M. (ed.) PKDD 1998. LNCS, vol. 1510, pp. 65–73. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  13. 13.
    Fellbaum, C. (ed.): WordNet, An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  14. 14.
    Grabar, N., Zweigenbaum, P.: Lexically-based terminology structuring: Some inherent limitations. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 23–53 (2004)Google Scholar
  15. 15.
    Matsuda, H., Ishihara, T., Hashimoto, A.: Classifying Molecular Sequences Using a Linkage Graph With Their Pairwise Similarities. Theoretical Computer Science 210(2), 305–325 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  16. 16.
    Hearst, M.A.: The use of categories and clusters in information access interfaces. In: Strzalkowski, T. (ed.) Natural Language Information Retrieval, pp. 333–374. Kluwer Academic Publishers, Dordrecht (1999)Google Scholar
  17. 17.
    Hofmann, T.: Unsupervised learning by Probabilistic Latent Semantic Analysis. Machine Learning 42, 177–196 (2001)zbMATHCrossRefGoogle Scholar
  18. 18.
    Ibekwe-SanJuan, F.: A linguistic and mathematical method for mapping thematic trends from texts. In: Proceedings of the 13th European Conference on Artificial Intelligence, Brighton UK, August 23-28, pp. 170–174 (1998)Google Scholar
  19. 19.
    Ibekwe-SanJuan, F., SanJuan, E.: Mining textual data through term variant clustering: the termwatch system. In: RIAO Proceedings, pp. 487–503 (2004)Google Scholar
  20. 20.
    Jacquemin, C.: Spotting and discovering terms through Natural Language Processing, p. 378. MIT Press, Cambridge (2001)Google Scholar
  21. 21.
    Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3) (1999)Google Scholar
  22. 22.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs (1988)zbMATHGoogle Scholar
  23. 23.
    Leclerc, B.: The residuation model for the ordinal construction of dissimilarities and other valued objects. In: Van Cutsem, B. (ed.) Classification and dissimilarity analysis. Lecture Notes in Statistics, vol. 93, pp. 149–171. Springer, Heidelberg (1994)Google Scholar
  24. 24.
    Leydesdorf, L.: Words and Co-Words as Indicators of Intellectual Organization. Research Policy 18, 209–223 (1989)CrossRefGoogle Scholar
  25. 25.
    Milligan, G.W., Cooper, M.C.: A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioural Research 21, 441–458 (1986)CrossRefGoogle Scholar
  26. 26.
    Morin, E., Jacquemin, C.: Automatic acquisition and expansion of hypernym links. Computer and the humanities 38(4), 363–396 (2004)CrossRefGoogle Scholar
  27. 27.
    Nenadic, G., Spassic, I., Ananiadou, S.: Mining term similarities from corpora. Recent Trends in Computational Terminology: Special Issue of Terminology 10(1), 34 (2004)Google Scholar
  28. 28.
    Pedersen, T., Patwardhan, Michelizzi: WordNet:Similarity - Measuring the Relatedness of Concepts. In: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI), San Jose, CA, July 25-29 (2004)Google Scholar
  29. 29.
    Polanco, X., Grivel, L., Royauté, J.: How to do things with terms in informetrics: terminological variation and stabilization as science watch indicators. In: Proceedings of the 5th International Conference of the International Society for Scientometrics and Informetrics, Illinois USA, June 7-10, pp. 435–444 (1995)Google Scholar
  30. 30.
    Schiffrin, R., Börner, K.: Mapping knowledge domains. Publication of the National Academy of Science (PNAS) 101(1), 5183–5185 (2004)CrossRefGoogle Scholar
  31. 31.
    Silberztein, M.: Dictionnaire électronique et analyse automatique des textes. Le système INTEX. Masson, Paris (1993) Google Scholar
  32. 32.
    Small, H.: Visualizing science by citation mapping. Journal of the American society for Information Science 50(9), 799–813 (1999)CrossRefGoogle Scholar
  33. 33.
    Yang, Y., Pierce, T., Carbonell, J.G.: A Study on Retrospective and On-line Event Detection. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 28–36 (1998)Google Scholar
  34. 34.
    Yee Yeung, K.: Clustering or automatic class discovery: non-hierarchical, non-SOM. In: A practical approach to microarray data analysis, Kluwer Academic Publisher, Dordrecht (2003)Google Scholar
  35. 35.
    Yeung, K.Y., Haynor, H., Ruzzo, W.L.: Validating Clustering for Gene Expression Data. In HYPERLINK Bioinformatics HYPERLINK, 17, 309–318 (2001)
  36. 36.
    Yeung, K.Y., Ruzzo, W.L.: Details of the Adjusted Rand Index and clustering algorithms. Supplement to the paper “An experimental study on Principal Component Analysis for clustering gene expression data”. In: HYPERLINK, pp. 763–774 (2001)
  37. 37.
    Zamir, O., Etzioni, O.: Web document Clustering, A feasibility demonstration. In: ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 46–54 (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Eric SanJuan
    • 1
  1. 1.LITA Université Paul Verlaine & URI-INIST/CNRSMetzFrance

Personalised recommendations