Advertisement

Selective Compound Splitting of Swedish Queries for Boolean Combinations of Truncated Terms

  • Rickard Cöster
  • Magnus Sahlgren
  • Jussi Karlgren
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3237)

Abstract

In languages that use compound words such as Swedish, it is often neccessary to split compound words when indexing documents or queries. One of the problems is that it is difficult to find constituents that express a concept similar to that expressed by the compound. The approach taken here is to expand a query with the leading constituents of the compound words. Every query term is truncated so as to increase recall by hopefully finding other compounds with the leading constituent as prefix. This approach increases recall in a rather uncontrolled way, so we use a Boolean quorum-level search method to rank documents both according to a tf-idf factor but also to the number of matching Boolean combinations.

The Boolean combinations performed relatively well, taking into consideration that the queries were very short (maximum of five search terms). Also included in this paper are the results of two other methods we are currently working on in our lab; one for re-ranking search results on the basis of stylistic analysis of documents, and one for dimensionality reduction using Random Indexing.

Keywords

Dimensionality Reduction Query Term Query Expansion Vector Space Model Compound Word 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Folk, M.J., Zoellick, B., Riccardi, G.: File Structures: An Object-Oriented Approach with C++, 3rd edn. Addison-Wesley, Reading (1998)Google Scholar
  2. 2.
    Kanerva, P., Kristofersson, J., Holst, A.: Random indexing of text samples for latent semantic analysis. In: Proceedings of the 22nd Annual Conference of the Cognitive Science Society, p. 1036, Erlbaum (2000)Google Scholar
  3. 3.
    Karlgren, J., Sahlgren, M.: From words to understanding. In: Uesaka, Y., Kanerva, P., Asoh, H. (eds.) Foundations of Real World Intelligence, pp. 294–308. CSLI publications, Stanford (2001)Google Scholar
  4. 4.
    Kaski, S.: Dimensionality reduction by random mapping: Fast similarity computation for clustering. In: Proceedings of the IJCNN 1998, International Joint Conference on Neural Networks, pp. 413–418. IEEE Service Center, Piscataway (1998)Google Scholar
  5. 5.
    Sahlgren, M., Karlgren, J., Cöster, R., Järvinen, T.: SICS at CLEF 2002: Automatic query expansion using random indexing. In: Peters, C., Braschler, M., Gonzalo, J. (eds.) CLEF 2002. LNCS, vol. 2785. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  7. 7.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th International Conference on Research and Development in Information Retrieval, pp. 21–29 (1996)Google Scholar
  8. 8.
    Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Rickard Cöster
    • 1
  • Magnus Sahlgren
    • 1
  • Jussi Karlgren
    • 1
  1. 1.Swedish Institute of Computer Science, SICSKistaSweden

Personalised recommendations