Advertisement

Discovering a Term Taxonomy from Term Similarities Using Principal Component Analysis

  • Holger Bast
  • Georges Dupret
  • Debapriyo Majumdar
  • Benjamin Piwowarski
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4289)

Abstract

We show that eigenvector decomposition can be used to extract a term taxonomy from a given collection of text documents. So far, methods based on eigenvector decomposition, such as latent semantic indexing (LSI) or principal component analysis (PCA), were only known to be useful for extracting symmetric relations between terms. We give a precise mathematical criterion for distinguishing between four kinds of relations of a given pair of terms of a given collection: unrelated (car – fruit), symmetrically related (car – automobile), asymmetrically related with the first term being more specific than the second (banana – fruit), and asymmetrically related in the other direction (fruit – banana). We give theoretical evidence for the soundness of our criterion, by showing that in a simplified mathematical model the criterion does the apparently right thing. We applied our scheme to the reconstruction of a selected part of the open directory project (ODP) hierarchy, with promising results.

Keywords

Taxonomy Extraction Ontology Extraction Semantic Tagging Latent Semantic Indexing Principal Component Analysis Eigenvector Decomposition 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agichtein, E., Gravano, L.: Snowball: Extracting relations from large plain-text collections. In: 5th Conference on Digital Libraries (DL 2000) (2000)Google Scholar
  2. 2.
    Anick, P.G., Tipirneni, S.: The paraphrase search assistant: terminological feedback for iterative information seeking. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 153–159. ACM Press, New York (1999)CrossRefGoogle Scholar
  3. 3.
    Bast, H., Majumdar, D.: Why spectral retrieval works. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 11–18. ACM, New York (2005)CrossRefGoogle Scholar
  4. 4.
    Chuang, S.-L., Chien, L.-F.: A practical web-based approach to generating topic hierarchy for text segments. In: CIKM 2004: Proceedings of the Thirteenth ACM conference on Information and knowledge management, pp. 127–136. ACM Press, New York (2004)CrossRefGoogle Scholar
  5. 5.
    Cimiano, P., Ladwig, G., Staab, S.: Gimme’ the context: context-driven automatic semantic annotation with c-pankow. In: 14th International Conference on the World Wide Web (WWW 2005), pp. 332–341 (2005)Google Scholar
  6. 6.
    Cimiano, P.B.P., Magnini, B.: Ontology Learning from Text: Methods, Evaluation and Applications. In: Frontiers in Artificial Intelligence and Applications, vol. 123. IOS Press, Amsterdam (2005)Google Scholar
  7. 7.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., McCurley, K., Rajagopalan, S., Tomkins, A., Tomlin, J., Zienberer, J.: A case for automated large scale semantic annotation. J. Web Semantics 1(1) (2003)Google Scholar
  8. 8.
    Dupret, G.: Latent concepts and the number orthogonal factors in latent semantic analysis. In: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 221–226. ACM Press, New York (2003)Google Scholar
  9. 9.
    Dupret, G.: Latent semantic indexing with a variable number of orthogonal factors. In: Proceedings of the RIAO 2004, Coupling approaches, coupling media and coupling languages for information retrieval, pp. 673–685, Centre de Hautes Etudes Internationales d’informatique documentaire, C.I.D., April 26-28 (2004) Google Scholar
  10. 10.
    Dupret, G., Piwowarski, B.: Deducing a Term Taxonomy from Term Similarities. In: ECML/PKDD 2005 Workshop on Knowledge Discovery and Ontologies (2005)Google Scholar
  11. 11.
    Dupret, G., Piwowarski, B.: Principal components for automatic term hierarchy building. In: Crestani, F., Ferragina, P., Sanderson, M. (eds.) SPIRE 2006. LNCS, vol. 4209, pp. 37–48. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  12. 12.
    Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. Chapman & Hall/CRC (May 15, 1994)Google Scholar
  13. 13.
    Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artificial Intelligence 165(1), 91–134 (2005)CrossRefGoogle Scholar
  14. 14.
    Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002: Proceedings of the eleventh international conference on Information and knowledge management, pp. 507–514. ACM Press, New York (2002)CrossRefGoogle Scholar
  15. 15.
    Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on Computational linguistics, Morristown, NJ, USA, pp. 539–545. Association for Computational Linguistics (1992)Google Scholar
  16. 16.
    Hearst, M.A.: Automated discovery of wordnet relations. In: Fellbaum, e., Christiane (eds.) WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998)Google Scholar
  17. 17.
    Joho, H., Coverson, C., Sanderson, M., Beaulieu, M.: Hierarchical presentation of expansion terms. In: SAC 2002: Proceedings of the 2002 ACM symposium on Applied computing, pp. 645–649. ACM Press, New York (2002)CrossRefGoogle Scholar
  18. 18.
    Lawrie, D., Croft, W.: Discovering and comparing topic hierarchies. In: Proceedings of RIAO 2000 (2000)Google Scholar
  19. 19.
    Lawrie, D., Croft, W.B., Rosenberg, A.: Finding topic words for hierarchical summarization. In: SIGIR 2001: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 349–357. ACM Press, New York (2001)CrossRefGoogle Scholar
  20. 20.
    Lawrie, D.J., Croft, W.B.: Generating hierarchical summaries for web searches. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 457–458. ACM Press, New York (2003)CrossRefGoogle Scholar
  21. 21.
    Maedche, A., Staab, S.: Discovering conceptual relations from text. In: 14th European Conference on Artifial Intelligence (ECAI 2000), pp. 321–325 (2000)Google Scholar
  22. 22.
    Nanas, N., Uren, V., Roeck, A.D.: Building and applying a concept hierarchy representation of a user profile. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 198–204. ACM Press, New York (2003)CrossRefGoogle Scholar
  23. 23.
    Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: Proceedings PODS 1998, pp. 159–168 (1998)Google Scholar
  24. 24.
    Park, Y.C., Han, Y.S., Choi, K.-S.: Automatic thesaurus construction using bayesian networks. In: CIKM 1995: Proceedings of the fourth international conference on Information and knowledge management, pp. 212–217. ACM Press, New York (1995)CrossRefGoogle Scholar
  25. 25.
    Sanderson, M., Croft, B.: Deriving concept hierarchies from text. In: SIGIR 1999: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 206–213. ACM Press, New York (1999)CrossRefGoogle Scholar
  26. 26.
    Sparck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11–21 (1972) (Reprinted in B. C. Griffith (ed.) Key Papers in Information Science (1980) Willett, P. (ed.) Document Retrieval Systems, 1988)CrossRefGoogle Scholar
  27. 27.
    Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M., Motta, E., Ciravegna, F.: Semantic annotation for knowledge management: Requirements and a survey of the state of the art. Journal of Web Semantics 4(1), 14–28 (2006)Google Scholar
  28. 28.
    Volz, R., Handschuh, S., Staab, S., Stojanovic, L., Stojanovic, N.: Unveiling the hidden bride: deep annotation for mapping and migrating legacy data to the semantic web. Journal of Web Semantics 1(2), 187–206 (2004)Google Scholar
  29. 29.
    Woods, W.A.: Conceptual indexing: A better way to organize knowledge. Technical report, Sun Labs Technical Report: TR-97-61 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Holger Bast
    • 1
  • Georges Dupret
    • 2
  • Debapriyo Majumdar
    • 1
  • Benjamin Piwowarski
    • 2
  1. 1.Max-Planck-Institut für InformatikSaarbrücken
  2. 2.Yahoo! Research Latin America 

Personalised recommendations