Skip to main content

Within-Language Information Retrieval

  • Chapter
  • First Online:
Book cover Multilingual Information Retrieval

Abstract

The information retrieval system stands at the core of many information acquisition cycles. Its task is the retrieval of relevant information from document collections in response to a coded query based on an information need. In its general form, when searching unstructured, natural language text produced by a large range of authors, this is a difficult task: in such text there are many different valid ways to convey the same information. Adding to the complexity of the task is an often incomplete understanding of the desired information by the user. In this chapter, we discuss the mechanisms employed for matching queries and (textual) documents within one language, covering some of the peculiarities of a number of widely spoken languages. Effective within-language retrieval is an essential prerequisite for effective multilingual information access. The discussion of within-language information retrieval or monolingual information retrieval can be structured into two main phases: the indexing phase, commonly implemented as a pipeline of indexing steps, producing a representation that is suitable for matching; and the matching phase, which operates on the indexed representations and produces a ranked list of documents that are most likely to satisfy the user’s underlying information need.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 54.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This is what distinguishes information retrieval systems from Database Management Systems (DBMS), where correctly structured data is stored according to a well-defined database schema and where users select subsets of this data according to exactly specified criteria.

  2. 2.

    This issue is especially important on comparatively ‘smaller’ document collections. The World Wide Web is a special case in that its massive size and the high redundancy of its content helps with locating some relevant documents for nearly all queries. In this sense, searching on smaller collections is often harder – the task of the user to pick good search terms is considerably more complex, and the IR system should thus be able to match on a wider range of possible query formulations.

  3. 3.

    One likely reason for the use of very short queries is the prevalent use of web search services such as Google and Yahoo, which by default narrow search results by using an implicit ‘AND’ operator for query terms.

  4. 4.

    We discuss the concept of relevance and its implications in more detail in Chapter 5.

  5. 5.

    One of the early definitions of the ‘document retrieval problem’ is given by Robertson et al. (1982): “…. the function of a document retrieval system is to retrieve all and only those documents that the inquiring patron wants (or would want).”

  6. 6.

    As could be done with a database query.

  7. 7.

    Defined as: “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” (Robertson 1977).

  8. 8.

    This is essentially an arbitrary data structure, built for efficient access to retrievable items. Details on the implementation of an index for an IR system are given in Section 2.5.2.

  9. 9.

    This is in contrast to databases, where every effort is made to store data in a normalised form.

  10. 10.

    Unicode is the current de-facto standard for character encoding of written text for applications that need to cover many diverse languages (www.unicode.org). XML is a text format intended for the creation of structured (text) documents that are machine-readable (http://www.w3.org/XML/).

  11. 11.

    For more details, see e.g., Erickson (1997).

  12. 12.

    The checksum is typically a numerical value that acts as a ‘summary’ of the underlying data and that is sensitive to (slight) changes therein. While two random pieces of text may share the same checksum, it is highly improbable that a text would still generate the same checksum after modification when using an appropriate algorithm for checksum calculation. Exact duplicates of documents will share the same checksum.

  13. 13.

    Conceptually, every document needs to be compared to all other documents in the collection.

  14. 14.

    A short, usually fixed-length, representation of the document is used that, unlike checksums, allows for similarity comparisons.

  15. 15.

    That is, the text underlying a hyperlink that is clicked by the user to follow the link.

  16. 16.

    While structurally the same as character n-grams for Latin script, as used in Step 5, they are named differently to account for the different role of Chinese characters.

  17. 17.

    This argument is obviously not without problems as very frequent words such as ‘not’ can easily change the meaning of a sentence to the contrary. However, the weighting approaches discussed in Section 2.5 are based mainly on frequency counts, and do not employ deeper semantic analysis of text.

  18. 18.

    Adaptations of the Porter stemmer to many different languages are today available.

  19. 19.

    In particular, there is no attempt to detect names and foreign words, which are thus potentially stemmed as well should they match the rules.

  20. 20.

    Not to be confused with word n-grams used to solve the segmentation problem in Eastern Asian languages, see Step 4.

  21. 21.

    Dutch, English, Finnish, French, German, Italian, Spanish, Swedish.

  22. 22.

    While ‘bag of words’ is the customary choice of term to describe the approach, it would be more helpful to use ‘bag of tokens’, to indicate the nature of the units contained in the indexed representation.

  23. 23.

    These implications are tied to the usual assumption that all indexing features that have corresponding tokens in the bag are independent of each other – an assumption that is clearly violated in the case of phrasal expressions.

  24. 24.

    The freely available and often used Smart stopword list is rather exhaustive, containing entries such as ‘need’, and thus nicely illustrating the issues with stopword elimination discussed in Section 2.4.4, by suppressing the final word of the abstract in the indexed representation.

  25. 25.

    Whether the text sample is actually relevant to the user's query is a different question; one which is only answerable if the underlying information need, user preferences, etc. are known.

  26. 26.

    A data structure that allows location of items at a ‘cost’ that is typically independent of the size of the structure (Sedgewick and Wayne 2011).

  27. 27.

    Derived from m original search terms, where potentially m <> n.

  28. 28.

    Also typically implemented in the form of a hash table.

  29. 29.

    A wildcard character is a placeholder for one or more characters that are unknown or unspecified.

  30. 30.

    Complexities arise when one considers the possibility of queries containing features that do not occur in any documents of the collection. Theoretically, the vector space is extended by those features (i.e., the vector space represents an ‘indexing vocabulary’ that extends beyond only those features present in the collection), but this case can be ignored for the discussion of the basic workings of the model as presented here.

  31. 31.

    The use of ‘term frequency’ actually gives rise to the common name for this weighting scheme: tf.idf-Cosine, although ff.idf-Cosine would be equally justified, especially when considering non-textual information.

  32. 32.

    The angle will be small independently of the length of the document. Since queries are typically much shorter than documents, this is a desirable feature.

  33. 33.

    Real probabilities would allow easy thresholding, which is important, e.g., for cutting off result lists at a certain probability level; or for filtering tasks, where documents would only be returned when exceeding the threshold.

  34. 34.

    Compare this to the vector space model, where the orthogonality of the axes that represent the features implicitly leads to the same assumption.

  35. 35.

    Sometimes incorrectly referred to as ‘Okapi weighting’, after the name of the retrieval system that originally implemented the scheme.

  36. 36.

    Also available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

  37. 37.

    See http://trec.nist.gov/

References

  • Abdou S, Savoy J (2006) Statistical and comparative evaluation of various indexing and search models. In: Proc. AIRS2006. Springer-Verlag LNCS 4182: 362–373

    Google Scholar 

  • Amati G, Rijsbergen CJV (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4): 357–389

    Article  Google Scholar 

  • Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. ACM Press, New York

    Google Scholar 

  • Braschler M, Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf. Retr. 7(3–4): 291–316

    Article  Google Scholar 

  • Braschler M, Gonzalo J (2009) Best practices in system and user-oriented multilingual information access. TrebleCLEF Project: http://www.trebleclef.eu/

  • Braschler M, Peters C (2004) Cross-language evaluation forum: objectives, results, achievements. Inf. Retr. 7(1–2): 7–31

    Article  Google Scholar 

  • Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1–7): 107–117

    Article  Google Scholar 

  • Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (COM ’00). Springer-Verlag, London: 1–10

    Google Scholar 

  • Chowdhury GG (2010) Introduction to modern information retrieval. Neal-Schuman Publishers

    Google Scholar 

  • Croft B, Metzler D, Strohman T (2009) Search engines: information retrieval in practice. Addison Wesley

    Google Scholar 

  • Dolamic L, Savoy J (2009) When stopword lists make the difference. J. of Am. Soc. for Inf. Sci. 61(1): 200–203

    Google Scholar 

  • Dunning T (1994) Statistical identification of language. Technical Report, CRL Technical Memo MCCS-94-273

    Google Scholar 

  • El-Khair IA (2007) Arabic information retrieval. Annu. Rev. of Inf. Sci. and Technol. 41(1): 505–533

    Article  Google Scholar 

  • Erickson JC (1997) Options for the presentation of multilingual text: use of the Unicode standard. Library Hi Tech, 15(3/4): 172–188

    Article  Google Scholar 

  • Harman D (1991) How effective is suffixing?. J. of the Am. Soc. for Inf. Sci. 42(1): 7–15

    Article  Google Scholar 

  • Hiemstra D, de Jong F (1999) Disambiguation strategies for cross-language information retrieval. In: Proc. 3rd European Conference on Research and Advanced Technology for Digital Libraries (ECDL ’99). Springer-Verlag, London: 274–293

    Google Scholar 

  • Hull DA (1996) Stemming algorithms - A case study for detailed evaluation. J. of the Am. Soc. for Inf. Sci. 47(1): 70–84

    Article  Google Scholar 

  • Kaszkiel M, Zobel J (1997) Passage retrieval revisited. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR ’97). ACM, New York: 178–185

    Google Scholar 

  • Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press

    MATH  Google Scholar 

  • McNamee P (2009) JHU ad hoc experiments at CLEF 2008. In: Proc. 9th Workshop of the Cross-Language Evaluation Forum (CLEF 2008). Springer-Verlag LNCS 5706: 170–177

    Google Scholar 

  • McNamee P and Mayfield J (2003) JHU/APL experiments in tokenization and non-word translation. In: Proc. 4th Workshop of the Cross-Language Evaluation Forum (CLEF 2003). Springer-Verlag LNCS 3237: 85–97

    Google Scholar 

  • Mitra M, Buckley C, Singhal A, Cardie C (1997) An analysis of statistical and syntactic phrases. In Proc. 5th International Conference in Computer-Assisted Information Retrieval (RIAO 1997): 200–214

    Google Scholar 

  • Ponte J, Croft B (1998) A language modeling approach to information retrieval. In Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR 98). ACM, New York: 275–281

    Google Scholar 

  • Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130–137. Reprint in: Spärck Jones K and Willett P (eds.): Readings in Information Retrieval. Morgan Kaufmann Publishers, San Francisco: 313–316

    Google Scholar 

  • Qiu Y, Frei H-P (1993) Concept based query expansion. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR '93). ACM, New York: 160–169

    Google Scholar 

  • Robertson SE (1977) The probability ranking principle in IR. J. of Doc. 33(4): 294–304

    Article  Google Scholar 

  • Robertson SE, van Rijsbergen CJ, Porter MF (1980): Probabilistic models of indexing and searching. Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR 80). ACM, New York: 35–56

    Google Scholar 

  • Robertson SE, Maron ME, Cooper WS (1982) Probability of relevance: a unification of two competing models for document retrieval. Info. Tech: R and D. 1: 1–21

    Google Scholar 

  • Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun. of the ACM 18(11): 613–620

    Article  MATH  Google Scholar 

  • Salton G, Buckley C (1990) Improving retrieval performance by relevance feedback. J. of Am. Soc. for Inf. Sci. 41(4): 288–297

    Article  Google Scholar 

  • Savoy J (2005) Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Trans. on Asian Lang. Inf. Proc. 42(2): 163–189

    Article  Google Scholar 

  • Schäuble P (1997) Multimedia information retrieval. Content-based information retrieval from large text and audio databases. Kluwer Academic Publishers

    Google Scholar 

  • Sedgewick R, Wayne K (2011) Algorithms, Fourth Edition, Section 3.4 “Hash Tables”. Addison-Wesley Professional

    Google Scholar 

  • Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR '96). ACM, New York: 21–29

    Google Scholar 

  • Spärck Jones K, Willett P (1997) Readings in information retrieval. Morgan Kaufmann

    Google Scholar 

  • Tague-Sutcliffe J (ed.) (1996) Evaluation of information retrieval systems. J. Am. Soc. for Inf. Sci. 47(1)

    Google Scholar 

  • Walker S, Robertson SE, Boughanem M, Jones GJF, Spärck Jones K (1998) Okapi at TREC-6, automatic ad hoc, VLC, routing, filtering and QSDR. In Voorhees EM, Harman DK, (eds.) The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240: 125–136

    Google Scholar 

  • Ubiquity (2002) Talking with Terry Winogard. Ubiquity Magazine 3(23)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carol Peters .

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Peters, C., Braschler, M., Clough, P. (2012). Within-Language Information Retrieval. In: Multilingual Information Retrieval. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23008-0_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-23008-0_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-23007-3

  • Online ISBN: 978-3-642-23008-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics