Keywords

1 Introduction

Previous studies on documents information retrieval were based on the bag-of-words concept for documents representation e.g., [1,2,3,4], that take into consideration the tf and idf but the major drawback of this concept is the ignoring of the term proximity. Proximity represents the closeness of the query terms appearing in a document.

For the purpose of improving document retrieval performance, much recent researches have provided new models that consider the proximity among query terms in documents. Those methods use the term positional information to compute the document score. Intuitively, document \(d_1\) ranked higher than documents \(d_2\) when the terms of the query occur closer together in \(d_1\) while the query-terms appear far apart in \(d_2\). Proximity can be seen as a kind of indirect measure of term dependence [5].

Even though the documents ranking has been improved in recent researches (e.g. [6,7,8,9,10,11]). Nonetheless, it may cause increasing in both time complexity and space complexity because they should make huge number of comparisons and they should store the results of these comparisons. Much recent, many researches use term/word signal representation or time series. The term signal represents a query term in a document and the term signal is transformed from time domain to frequency domain using known transformation techniques such as Fourier transform [12,13,14,15]. Cosine transform [16], Wavelet transform [17, 18] etc.

By using the DWT, as a filter bank and multi resolution analysis, we are able to analyze a document using a multi-glasses concentration. One glass is to analyze a document as a whole which is similar to vector space method. When using this glass or lens, it is like using a telescope to focus or view the entire document. The glass does not need to be moved at all details (such as the proximity among query terms in specific line or paragraph) of a document. Using another glass, such as an opera glass, which may focuses on the first half of a document and it needs to be moved at all details in this region. A final example of a glass, such as a microscopic glass, which may focuses on the third line of a document and it also needs to be moved at all details in this line.

This chapter presents an information retrieval framework based on DWT using both Haar and Daubechies wavelet transform algorithms. The framework achieves the speed of the classical methods such as vector space methods with the benefits of the proximity methods to provide a high quality information retrieval system. The retrieval is performed by the following: (a) locating the appearances of the query terms in each document, (b) transforming the document information into the time scale, and (c) analyzing the time scale with the help of the DWT.

Information retrieval systems based on DWT has many advantages over classical information retrieval systems but using fixed document segmentation has many disadvantages because of neglecting document semantic representation. The main research hypothesis of this chapter is the necessity of using spectral analysis based on different types of semantic segmentation, namely sentence-based segmentation, paragraph-based segmentation and fixed length segmentation, to improve document score.

This chapter will proceed as follows, Sect. 2 presents the related background, Sect. 3 introduces most important issues of the design and implementation of information retrieval using DWT method, Sect. 4 the experiments and results using the dataset Text REtrieval Conference (TREC) collection, and finally the conclusion is shown in Sect. 5.

2 Background

In this section, the major related conceptes are reviewed.

2.1 Term Signal

The notion of term signal was introduced by [14] before being reintroduced by [19] as a word signal. It is term vector representation, which defines the term frequency of appearance in particular bins or segments within a document. The term signal shows how a query term is spread over a document. If a term appearing in a document can be thought of as a signal, then this signal can be transformed using one of the transformation algorithms such as Fourier transform or wavelet transform and compared with other signals. Term signal has representative information of a term in time domain since it describes when the term occurs and its frequency in a certain region. Applying this approach, for each document in the dataset, a set of term signals is computed as a sequence of integers, which shows the term frequency in a specific region of a document. In document d, the representation of a query term t signal is:

$$\begin{aligned} s(t, d) = \tilde{f}_{t,d} = [f_{t,1,d}, f_{t,2,d}, ..., f_{t,B,d}], \end{aligned}$$
(1)

where

\({f_{t,b,d}}\): is frequency of term t in segment b of document d for 1 \(\le \) b \(\le \) B.

2.2 Weighting Scheme

Zobel and Moffat [20] provided a more general structure of the form AB-CDE-FGH, to allow for the many different scoring functions that had appeared. Their study compared 720 different weighting schemes based on the TREC document set. Their results showed that the BD-ACI-BCA method was the overall best performer. The computed segment components are weighted using a BD-ACI-BCA weighting scheme as follows:

$$\begin{aligned} \omega _{t,b,d} = \frac{1+log_{e}\, f_{t,b,d}}{(1-slp)+\frac{slp.W_{d}}{ave_{d \in D}W_{d}}}, \end{aligned}$$
(2)
figure a

The term signal will be correspond to the weighted term signal as follows:

$$\begin{aligned} \tilde{\omega }_{t,d} = [\omega _{t,1,d}, \omega _{t,2,d}, ..., \omega _{t,B,d}]. \end{aligned}$$
(3)

2.3 Document Segmentation

Document segmentation is very important issue in information retrieval specially when using any transformation algorithms such as DWT because it determine the acceptance distance between two query terms in a document to be considered in the same region (bin, line, paragraph, etc.). Three suggested segmentation were introduced in [21]. The suggested segmentation includes:

  • Sentence-based segmentation.

  • Paragraph-based segmentation.

  • Fixed length segmentation.

Before diving into details of segmentation, one should understand that if the query terms are found in the same segment/bin the wavelet transform algorithm should gives higher score than if there are scattered in the document. That is why segmentation is a very important issue. The sentence-based segmentation and paragraph-based segmentation were firstly investigated by [21] to give a real meaning of proximity. One of the limitations with wavelet transform algorithm is that it requires exact \(2^n\) where \(n \in Z\) number of segments (bins maybe used as a synonym). There is no need for zero padding process in the third item, fixed length segmentation, because the number of segments used is in the form of \(2^n\) where \(n \in Z\).

The number of segments when using the fixed length segmentation relies too heavily on the nature of the dataset used. Different length of documents require different size of segments. For instance, datasets such as chats, reviews or microblogs require very small number of bins (4, 8 or 16) [22]. Dataset such as TREC requires larger number of bins (8, 16 or 32) which used in [21] as well as in [23].

Also, another problem relies on the nature of dataset is the definition of document. Is the document is the whole chat or a single message. While another dataset such as the Holy Quran, document maybe a verse, a chapter, a related topic (sequence of verses) or a page [24].

The problem was in the sentence-based segmentation and paragraph-based segmentation because the number of segments is not necessary in the form of \(2^n\). In this case, zero padding process is needed. Zero padding refers to adding zeros to end of a term signal to increase its length. There is a need to increase the length of term signal if the number of segments used (based on sentence or paragraph) is in between \(2^{n-1}\) and \(2^n\) where \(n \in Z\). Much of inaccurate results from the suggested information retrieval model stems from using zero padding process in term signal.

There are two different behaviors regarding zero padding from Haar and Daubechies wavelet transform algorithms:

  • The Haars decomposition are discontinuous functions that are not successful in smoothing zero padding.

  • Daubechies supports orthogonal wavelets with a preassigned degree of smoothness.

2.4 Wavelet Transform Algorithm

The wavelet transform algorithm has been used in many domains such as image processing, edge detection, data compression, image compression, video compression, audio compression, compression of electroencephalography (EEG) signals, compression of electrocardiograph (ECG) signals, denoising and filtering, and finally in information retrieval and text mining. Wavelet transform algorithm has the ability to decompose natural signals into scaled and shifted versions of itself [25]. Wavelets are defined by the wavelet function \(\psi (t)\) and scaling function \(\varphi (t)\) in the time domain. The wavelet function is described as \(\psi \in \mathbf {L}^2\mathbb {R}\) (where \(\mathbf {L}^2\mathbb {R}\) is the set of functions f(t) which satisfy \(\int \mid f(t)\mid ^2 dt < \infty \)) with a zero average and norm of 1. A wavelet can be scaled and translated by adjusting the parameters s and u, respectively.

$$\begin{aligned} \psi _{u,s}(t) = \frac{1}{\sqrt{s}}\psi \left( \frac{t-u}{s}\right) . \end{aligned}$$
(4)

The scaling factor keeps the norm equal to one for all s and u. The wavelet transform of \(f\in \mathbf {L}^2\mathbb {R}\) at time u and scale s is:

$$\begin{aligned} Wf(u,s) =\int _{-\infty }^{+\infty } f(t)\frac{1}{\sqrt{s}}\psi ^* \left( \frac{t-u}{s}\right) dt, \end{aligned}$$
(5)

where s and \(u \in \mathbf {Z}\) and \(\psi ^*\) is the complex conjugate of \(\psi \) [17].

The scaling function is described as \(\varphi _{u,s}(t) \in V_n\). The scaling function satisfies the following:

$$ \{0\} \leftarrow ... \subset V_{n+1} \subset V_n \subset V_{n-1} ...\rightarrow \mathbf {L}^2 ,$$
figure b

The filter bank tree-structured algorithm can use to compute the DWT.The signal f(t) can be transformed using the filter bank tree of DWT:

$$ \begin{array}{c} f \xrightarrow {DWT^1} A^1 + D^1\\ \xrightarrow {DWT^2} A^2 + D^2 + D^1\\ \xrightarrow {DWT^3} A^3 + D^3 + D^2 + D^1\\ ..... \\ \end{array} $$
$$\begin{aligned} \xrightarrow {DWT^s} A^s + D^s + D^{s-1} +\cdots + D^1 \end{aligned}$$
(6)

where \(A^s\) correspond to the approximation sub-signal and \(D^s\) corresponds to a detail sub-signal in the sth level of transforms of DWT of the signal f(t), respectively.

2.4.1 Haar Wavelet Transform

One of the simplest wavelet transforms is Haar transform [26]. The Haar transform is derived from the Haar matrix. It provides the different levels of resolution of a signal. At different resolutions, the positions of the terms are showed by the transformed signal. Every possible wavelet scaled and shifted version is take in Haar transform. Given a term signal \(\tilde{f}_{t,d}\) and number of segments N, the wavelet components will be \(H_N\tilde{f}_{t,d}\) where \(H_N\) is the Haar matrix. When \(N=4\) we have

$$ H_4 = \frac{1}{2} \begin{bmatrix} 1&1&1&1 \\ 1&1&-1&-1 \\ \sqrt{2}&-\sqrt{2}&0&0 \\ 0&0&\sqrt{2}&-\sqrt{2} \end{bmatrix} $$

2.4.2 Daubechies Wavelet Transform

In this research, we are taking the advantage of smoothing the transform of term signal by using Daubechies wavelet transform. There are many types of Daubechies transforms. The simplest one is Daub4 wavelet. The Daub4 wavelet transform is defined as the same way as the Haar wavelet transform and differs in how scaling functions and wavelets are defined.

The following code compute the Daubechies wavelet transform (Daub4) using Python. The main function that can be called is daub4Transform. The function can be called with one parameter which is the term signal. The length of term signal must equal to \(2^n\) and \(n \in Z\). The following is an example of calling the function:

print (daub4Transform([1,2,3,0,0,0,0,0]))

figure c

The complexity time of matrix multiplication that causes this transformation is \(O(N^2)\) for signals of N elements. To enhance the time of this process, one may use the wavelets scaling function and the wavelet function.

3 Design Issues and Implementation of Information Retrieval Using DWT

3.1 Problems and Design Issues

Park in [15,16,17] assumes that documents have fixed length as well as they are unformatted. Using fixed number of bins in all dataset has many disadvantages:

  1. 1.

    It does not distinguish between long documents and short documents. Suppose using of fixed number of bins B such that \(B = 2^i\) and \(i \in \mathbb {N}\). For any document d, has a number of words |d|, the bin b contains number of words \(|b| = \frac{|d|}{B}\). That means there are different levels of proximity for different documents. Number of bins in microblogs [22] should be very small in comparison to the number of bins in scientific theses.

  2. 2.

    It does not consider the length of a query. If all terms of a query are found in one bin of a document, that means the document is more related to the query than a document holding the same query terms scattered in more than one bin. To consider this issue, the length of the suggested bin should be \(|b| \ge |Q|\).

  3. 3.

    It does not consider the semantic representation of documents. A paragraph, which is a unit of thought that is focused on a single topic, may represent semantically a single bin. Also, a sentence may represent semantically a single bin.

  4. 4.

    Also, it does not consider the weight of matched query terms in a document according to it terms’ location. The matched query terms in the first part of a document should have a greater weight more than the matched query terms in the middle and the end in the retrieved documents.

The main hypothesis in this research is to consider the necessity of using different types of segmentation and term weighting beside DWT to improve document score. The algorithms and techniques will be described in the following subsections.

3.2 Implementation of the Suggested Model

The suggested model was implemented using python 2.7 and Natural Language Toolkit (NLTK) 3.0 that is a leading platform for building Python programs to work with natural language text. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries.

3.3 Document Segmentation

Document may be segmented by fixed length, sentences or paragraphs. If we use fixed length the minimum distance between any occurrence of term \(t_i\) and term \(t_j\) in a document D should be considered. Suppose the document contains C words, d is the maximum distance, S represents the number of sentences and P represents the number of paragraphs. The main problem is to construct the Haar matrix in these different situations. The following algorithms show how to construct the Haar matrix.

figure d

3.4 Term Weighting

We hypothesize that the significant terms always appear in the front of a document. Formatted document may be segmented by document length, sentences, topics/subtopics or paragraphs. While unformatted documents may be segmented by document length, sentences or paragraphs. The minimum distance between any occurrence of term \(t_i\) and term \(t_j\) in a document D should be considered.

figure e
figure f

4 Experiments and Results

The TREC dataset is used to evaluate the proposed methods. Exactly, the Associated Press disk 2 and Wall Street Journal disk 2 (AP2WSJ2) has been chosen. The AP2WSJ2 set contains more than 150,000 documents. The selected query set was from 51 to 200 (in TREC 1, 2, and 3). The investigated methods will be compared with the previous high precision methods. The experiments are performed based on a variety of the following segmentation methods:

  • Fixed number of segments (8 bins).

  • Sentence-based segmentation.

  • Paragraph-based segmentation.

  • Fixed number of terms in a segment (\(d = 3\)).

  • Fixed number of terms in a segment (\(d = 10\)).

  • Paragraph-based segmentation with position weight.

    These segmentation methods will be compared against another important term proximity measure namely shortest-substring retrieval (SSR)  [6] and window based bi-gram BM25 model [27].

In the experiment, the TREC official evaluation measures have been used, namely the Mean Average Precision (MAP), and Average Precision at K (where \(K = \{5,10,15,20\}\)).

Table 1 shows that most of the suggested segmentation methods does not have significant improvements. All suggested segmentation methods outperform SSR method. However, they did not outperform the fixed bin segmentation method and window based bi-gram BM25 model (8 bins) while the fixed bin segmentation method outperform window based bi-gram BM25 model.

Table 1 Comparison of Mean Average Precision (MAP), and Average Precision at K (where \(K = \{5,10,15,20\}\)) for different segmentation methods for DWT using Haar Wavelet Transform, SSR and window based bi-gram BM25 model
Table 2 Comparison of Mean Average Precision (MAP), and Average Precision at K (where \(K = \{5,10,15,20\}\)) for different segmentation methods for DWT using daubechies (Daub4) Wavelet Transform, SSR and window based bi-gram BM25 model

Because of using Haar wavelet transform leads to zero padding in the term signal which may affect the accuracy of the results, Daubechies also is used to improve the accuracy of the retreival.

In Paragraph-based segmentation with position weight method, two different additional weight schemes have been used for the first three paragraphs as follows (0.75, 0.50, 0.25) and (0.2, 0.1, 0.05).

Table 2 reveals the results of using the same suggested segmentation methods when using Daubechies (Daub4) wavelet transform. Compared with Haar wavelet transform, Daubechies wavelet transform outperforms Haar wavelet transform in most cases and measures. Even the values of results are hardly distinguishable from Haar wavelet transform as in Table 1, but there is a slight improvement of the results.

Also, there is a slight improvement of the results when using more smoothing, the scheme [0.2, 0.1, 0.05] gives better results than [0.75, 0.50, 0.25, 0.0], for the additional weight schemes for the position of sentences and paragraphs. This confirms the previous findings in the literature as in [28, 29].

5 Conclusion

This chapter has given an account of using DWT in information retrieval and the reasons for the reasons for using different document segmentations as well as proposing a variety of segmentation methods to enhance document score and the accuracy of retrieving. All suggested segmentation methods are based on meaning and context. All suggested segmentation methods outperform SSR method while the sentence-based, paragraph-based and sentence-based with position weight outperform window based bi-gram BM25 model. However, they did not outperform the fixed bin segmentation method (8 bins).

This research has argued the using of different wavelet transform algorithms such as Haar and Daubechies and how Daubechies slightly outperforms Haar. One of the most important future work is to find the optimum bin number and to elaborate the additional weight scheme for a given dataset.