Within-Language Information Retrieval

Peters, Carol; Braschler, Martin; Clough, Paul

doi:10.1007/978-3-642-23008-0_2

Carol Peters⁴,
Martin Braschler⁵ &
Paul Clough⁶

1398 Accesses

Abstract

The information retrieval system stands at the core of many information acquisition cycles. Its task is the retrieval of relevant information from document collections in response to a coded query based on an information need. In its general form, when searching unstructured, natural language text produced by a large range of authors, this is a difficult task: in such text there are many different valid ways to convey the same information. Adding to the complexity of the task is an often incomplete understanding of the desired information by the user. In this chapter, we discuss the mechanisms employed for matching queries and (textual) documents within one language, covering some of the peculiarities of a number of widely spoken languages. Effective within-language retrieval is an essential prerequisite for effective multilingual information access. The discussion of within-language information retrieval or monolingual information retrieval can be structured into two main phases: the indexing phase, commonly implemented as a pipeline of indexing steps, producing a representation that is suitable for matching; and the matching phase, which operates on the indexed representations and produces a ranked list of documents that are most likely to satisfy the user’s underlying information need.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This is what distinguishes information retrieval systems from Database Management Systems (DBMS), where correctly structured data is stored according to a well-defined database schema and where users select subsets of this data according to exactly specified criteria.
2.
This issue is especially important on comparatively ‘smaller’ document collections. The World Wide Web is a special case in that its massive size and the high redundancy of its content helps with locating some relevant documents for nearly all queries. In this sense, searching on smaller collections is often harder – the task of the user to pick good search terms is considerably more complex, and the IR system should thus be able to match on a wider range of possible query formulations.
3.
One likely reason for the use of very short queries is the prevalent use of web search services such as Google and Yahoo, which by default narrow search results by using an implicit ‘AND’ operator for query terms.
4.
We discuss the concept of relevance and its implications in more detail in Chapter 5.
5.
One of the early definitions of the ‘document retrieval problem’ is given by Robertson et al. (1982): “…. the function of a document retrieval system is to retrieve all and only those documents that the inquiring patron wants (or would want).”
6.
As could be done with a database query.
7.
Defined as: “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data has been made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” (Robertson 1977).
8.
This is essentially an arbitrary data structure, built for efficient access to retrievable items. Details on the implementation of an index for an IR system are given in Section 2.5.2.
9.
This is in contrast to databases, where every effort is made to store data in a normalised form.
10.
Unicode is the current de-facto standard for character encoding of written text for applications that need to cover many diverse languages (www.unicode.org). XML is a text format intended for the creation of structured (text) documents that are machine-readable (http://www.w3.org/XML/).
11.
For more details, see e.g., Erickson (1997).
12.
The checksum is typically a numerical value that acts as a ‘summary’ of the underlying data and that is sensitive to (slight) changes therein. While two random pieces of text may share the same checksum, it is highly improbable that a text would still generate the same checksum after modification when using an appropriate algorithm for checksum calculation. Exact duplicates of documents will share the same checksum.
13.
Conceptually, every document needs to be compared to all other documents in the collection.
14.
A short, usually fixed-length, representation of the document is used that, unlike checksums, allows for similarity comparisons.
15.
That is, the text underlying a hyperlink that is clicked by the user to follow the link.
16.
While structurally the same as character n-grams for Latin script, as used in Step 5, they are named differently to account for the different role of Chinese characters.
17.
This argument is obviously not without problems as very frequent words such as ‘not’ can easily change the meaning of a sentence to the contrary. However, the weighting approaches discussed in Section 2.5 are based mainly on frequency counts, and do not employ deeper semantic analysis of text.
18.
Adaptations of the Porter stemmer to many different languages are today available.
19.
In particular, there is no attempt to detect names and foreign words, which are thus potentially stemmed as well should they match the rules.
20.
Not to be confused with word n-grams used to solve the segmentation problem in Eastern Asian languages, see Step 4.
21.
Dutch, English, Finnish, French, German, Italian, Spanish, Swedish.
22.
While ‘bag of words’ is the customary choice of term to describe the approach, it would be more helpful to use ‘bag of tokens’, to indicate the nature of the units contained in the indexed representation.
23.
These implications are tied to the usual assumption that all indexing features that have corresponding tokens in the bag are independent of each other – an assumption that is clearly violated in the case of phrasal expressions.
24.
The freely available and often used Smart stopword list is rather exhaustive, containing entries such as ‘need’, and thus nicely illustrating the issues with stopword elimination discussed in Section 2.4.4, by suppressing the final word of the abstract in the indexed representation.
25.
Whether the text sample is actually relevant to the user's query is a different question; one which is only answerable if the underlying information need, user preferences, etc. are known.
26.
A data structure that allows location of items at a ‘cost’ that is typically independent of the size of the structure (Sedgewick and Wayne 2011).
27.
Derived from m original search terms, where potentially m <> n.
28.
Also typically implemented in the form of a hash table.
29.
A wildcard character is a placeholder for one or more characters that are unknown or unspecified.
30.
Complexities arise when one considers the possibility of queries containing features that do not occur in any documents of the collection. Theoretically, the vector space is extended by those features (i.e., the vector space represents an ‘indexing vocabulary’ that extends beyond only those features present in the collection), but this case can be ignored for the discussion of the basic workings of the model as presented here.
31.
The use of ‘term frequency’ actually gives rise to the common name for this weighting scheme: tf.idf-Cosine, although ff.idf-Cosine would be equally justified, especially when considering non-textual information.
32.
The angle will be small independently of the length of the document. Since queries are typically much shorter than documents, this is a desirable feature.
33.
Real probabilities would allow easy thresholding, which is important, e.g., for cutting off result lists at a certain probability level; or for filtering tasks, where documents would only be returned when exceeding the threshold.
34.
Compare this to the vector space model, where the orthogonality of the axes that represent the features implicitly leads to the same assumption.
35.
Sometimes incorrectly referred to as ‘Okapi weighting’, after the name of the retrieval system that originally implemented the scheme.
36.
Also available online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html
37.
See http://trec.nist.gov/

References

Abdou S, Savoy J (2006) Statistical and comparative evaluation of various indexing and search models. In: Proc. AIRS2006. Springer-Verlag LNCS 4182: 362–373
Google Scholar
Amati G, Rijsbergen CJV (2002) Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20(4): 357–389
Article Google Scholar
Baeza-Yates R, Ribeiro-Neto B (2011) Modern information retrieval, 2nd edn. ACM Press, New York
Google Scholar
Braschler M, Ripplinger B (2004) How effective is stemming and decompounding for German text retrieval? Inf. Retr. 7(3–4): 291–316
Article Google Scholar
Braschler M, Gonzalo J (2009) Best practices in system and user-oriented multilingual information access. TrebleCLEF Project: http://www.trebleclef.eu/
Braschler M, Peters C (2004) Cross-language evaluation forum: objectives, results, achievements. Inf. Retr. 7(1–2): 7–31
Article Google Scholar
Brin S, Page L (1998) The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1–7): 107–117
Article Google Scholar
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proc. 11th Annual Symposium on Combinatorial Pattern Matching (COM ’00). Springer-Verlag, London: 1–10
Google Scholar
Chowdhury GG (2010) Introduction to modern information retrieval. Neal-Schuman Publishers
Google Scholar
Croft B, Metzler D, Strohman T (2009) Search engines: information retrieval in practice. Addison Wesley
Google Scholar
Dolamic L, Savoy J (2009) When stopword lists make the difference. J. of Am. Soc. for Inf. Sci. 61(1): 200–203
Google Scholar
Dunning T (1994) Statistical identification of language. Technical Report, CRL Technical Memo MCCS-94-273
Google Scholar
El-Khair IA (2007) Arabic information retrieval. Annu. Rev. of Inf. Sci. and Technol. 41(1): 505–533
Article Google Scholar
Erickson JC (1997) Options for the presentation of multilingual text: use of the Unicode standard. Library Hi Tech, 15(3/4): 172–188
Article Google Scholar
Harman D (1991) How effective is suffixing?. J. of the Am. Soc. for Inf. Sci. 42(1): 7–15
Article Google Scholar
Hiemstra D, de Jong F (1999) Disambiguation strategies for cross-language information retrieval. In: Proc. 3rd European Conference on Research and Advanced Technology for Digital Libraries (ECDL ’99). Springer-Verlag, London: 274–293
Google Scholar
Hull DA (1996) Stemming algorithms - A case study for detailed evaluation. J. of the Am. Soc. for Inf. Sci. 47(1): 70–84
Article Google Scholar
Kaszkiel M, Zobel J (1997) Passage retrieval revisited. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR ’97). ACM, New York: 178–185
Google Scholar
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press
MATH Google Scholar
McNamee P (2009) JHU ad hoc experiments at CLEF 2008. In: Proc. 9th Workshop of the Cross-Language Evaluation Forum (CLEF 2008). Springer-Verlag LNCS 5706: 170–177
Google Scholar
McNamee P and Mayfield J (2003) JHU/APL experiments in tokenization and non-word translation. In: Proc. 4th Workshop of the Cross-Language Evaluation Forum (CLEF 2003). Springer-Verlag LNCS 3237: 85–97
Google Scholar
Mitra M, Buckley C, Singhal A, Cardie C (1997) An analysis of statistical and syntactic phrases. In Proc. 5th International Conference in Computer-Assisted Information Retrieval (RIAO 1997): 200–214
Google Scholar
Ponte J, Croft B (1998) A language modeling approach to information retrieval. In Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR 98). ACM, New York: 275–281
Google Scholar
Porter MF (1980) An algorithm for suffix stripping. Program, 14(3):130–137. Reprint in: Spärck Jones K and Willett P (eds.): Readings in Information Retrieval. Morgan Kaufmann Publishers, San Francisco: 313–316
Google Scholar
Qiu Y, Frei H-P (1993) Concept based query expansion. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR '93). ACM, New York: 160–169
Google Scholar
Robertson SE (1977) The probability ranking principle in IR. J. of Doc. 33(4): 294–304
Article Google Scholar
Robertson SE, van Rijsbergen CJ, Porter MF (1980): Probabilistic models of indexing and searching. Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR 80). ACM, New York: 35–56
Google Scholar
Robertson SE, Maron ME, Cooper WS (1982) Probability of relevance: a unification of two competing models for document retrieval. Info. Tech: R and D. 1: 1–21
Google Scholar
Salton G, Wong A, Yang C (1975) A vector space model for automatic indexing. Commun. of the ACM 18(11): 613–620
Article MATH Google Scholar
Salton G, Buckley C (1990) Improving retrieval performance by relevance feedback. J. of Am. Soc. for Inf. Sci. 41(4): 288–297
Article Google Scholar
Savoy J (2005) Comparative study of monolingual and multilingual search models for use with Asian languages. ACM Trans. on Asian Lang. Inf. Proc. 42(2): 163–189
Article Google Scholar
Schäuble P (1997) Multimedia information retrieval. Content-based information retrieval from large text and audio databases. Kluwer Academic Publishers
Google Scholar
Sedgewick R, Wayne K (2011) Algorithms, Fourth Edition, Section 3.4 “Hash Tables”. Addison-Wesley Professional
Google Scholar
Singhal A, Buckley C, Mitra M (1996) Pivoted document length normalization. In: Proc. ACM SIGIR conference on research and development in information retrieval (SIGIR '96). ACM, New York: 21–29
Google Scholar
Spärck Jones K, Willett P (1997) Readings in information retrieval. Morgan Kaufmann
Google Scholar
Tague-Sutcliffe J (ed.) (1996) Evaluation of information retrieval systems. J. Am. Soc. for Inf. Sci. 47(1)
Google Scholar
Walker S, Robertson SE, Boughanem M, Jones GJF, Spärck Jones K (1998) Okapi at TREC-6, automatic ad hoc, VLC, routing, filtering and QSDR. In Voorhees EM, Harman DK, (eds.) The Sixth Text REtrieval Conference (TREC-6). NIST Special Publication 500–240: 125–136
Google Scholar
Ubiquity (2002) Talking with Terry Winogard. Ubiquity Magazine 3(23)
Google Scholar

Download references

Author information

Authors and Affiliations

Consiglio Nazionale delle Ricerche Istituto di Scienza e Tecnologie dell’Informazione, Via G. Moruzzi 1, 56124, Pisa, Italy
Carol Peters
Institute of Applied Information Technology, Zurich University of Applied Sciences, Steinberggasse 13, 8401, Winterthur, Switzerland
Martin Braschler
Information School, University of Sheffield, 211 Portobello Street Regent Court, Sheffield, S1 4DP, UK
Paul Clough

Authors

Carol Peters
View author publications
You can also search for this author in PubMed Google Scholar
Martin Braschler
View author publications
You can also search for this author in PubMed Google Scholar
Paul Clough
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carol Peters .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Peters, C., Braschler, M., Clough, P. (2012). Within-Language Information Retrieval. In: Multilingual Information Retrieval. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23008-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-23008-0_2
Published: 28 September 2011
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23007-3
Online ISBN: 978-3-642-23008-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics