Abstract
Most of the information retrieval models represent documents as bag-of words which takes into account the term frequencies (tf) and inverse document frequencies (idf). However, most of these models ignore the distance among query terms in the documents (i.e. term proximity). Several researches have appeared in recent years using the term proximity among query terms to increase the efficiency of document retrieval. To solve proximity problem, several researches have implemented tools to specify term proximity at the query formulation level. They rank documents based on the relative positions of the query terms within the documents. They should store all proximity data in the index, affecting on the size of index, which slows down the search process. In the last decade, many researches provided models that use term signal representation to represent a query term, the query is transformed from the time domain into the frequency domain using transformation techniques such as wavelet. Discrete Wavelet Transform (DWT), such as Haar and Daubechies, uses multiple resolutions technique by which different frequencies are analyzed with different resolutions. The advantage of the DWT is to consider the spatial information of the query terms within the document rather than using only the count of terms. In this chapter, in order to improve ranking score as well as improve the run-time efficiency to resolve the query, and maintain a reasonable space for the index, two different discrete wavelet transform algorithms have been applied namely: Haar and Daubechies, three different types of spectral analysis based on semantic segmentation are carried out namely: sentence-based segmentation, paragraph-based segmentation and fixed length segmentation; and also different term weighting is performed according to term position. The experiments were constructed using the Text REtrieval Conference (TREC) collection.
Similar content being viewed by others
Keywords
1 Introduction
Previous studies on documents information retrieval were based on the bag-of-words concept for documents representation e.g., [1,2,3,4], that take into consideration the tf and idf but the major drawback of this concept is the ignoring of the term proximity. Proximity represents the closeness of the query terms appearing in a document.
For the purpose of improving document retrieval performance, much recent researches have provided new models that consider the proximity among query terms in documents. Those methods use the term positional information to compute the document score. Intuitively, document \(d_1\) ranked higher than documents \(d_2\) when the terms of the query occur closer together in \(d_1\) while the query-terms appear far apart in \(d_2\). Proximity can be seen as a kind of indirect measure of term dependence [5].
Even though the documents ranking has been improved in recent researches (e.g. [6,7,8,9,10,11]). Nonetheless, it may cause increasing in both time complexity and space complexity because they should make huge number of comparisons and they should store the results of these comparisons. Much recent, many researches use term/word signal representation or time series. The term signal represents a query term in a document and the term signal is transformed from time domain to frequency domain using known transformation techniques such as Fourier transform [12,13,14,15]. Cosine transform [16], Wavelet transform [17, 18] etc.
By using the DWT, as a filter bank and multi resolution analysis, we are able to analyze a document using a multi-glasses concentration. One glass is to analyze a document as a whole which is similar to vector space method. When using this glass or lens, it is like using a telescope to focus or view the entire document. The glass does not need to be moved at all details (such as the proximity among query terms in specific line or paragraph) of a document. Using another glass, such as an opera glass, which may focuses on the first half of a document and it needs to be moved at all details in this region. A final example of a glass, such as a microscopic glass, which may focuses on the third line of a document and it also needs to be moved at all details in this line.
This chapter presents an information retrieval framework based on DWT using both Haar and Daubechies wavelet transform algorithms. The framework achieves the speed of the classical methods such as vector space methods with the benefits of the proximity methods to provide a high quality information retrieval system. The retrieval is performed by the following: (a) locating the appearances of the query terms in each document, (b) transforming the document information into the time scale, and (c) analyzing the time scale with the help of the DWT.
Information retrieval systems based on DWT has many advantages over classical information retrieval systems but using fixed document segmentation has many disadvantages because of neglecting document semantic representation. The main research hypothesis of this chapter is the necessity of using spectral analysis based on different types of semantic segmentation, namely sentence-based segmentation, paragraph-based segmentation and fixed length segmentation, to improve document score.
This chapter will proceed as follows, Sect. 2 presents the related background, Sect. 3 introduces most important issues of the design and implementation of information retrieval using DWT method, Sect. 4 the experiments and results using the dataset Text REtrieval Conference (TREC) collection, and finally the conclusion is shown in Sect. 5.
2 Background
In this section, the major related conceptes are reviewed.
2.1 Term Signal
The notion of term signal was introduced by [14] before being reintroduced by [19] as a word signal. It is term vector representation, which defines the term frequency of appearance in particular bins or segments within a document. The term signal shows how a query term is spread over a document. If a term appearing in a document can be thought of as a signal, then this signal can be transformed using one of the transformation algorithms such as Fourier transform or wavelet transform and compared with other signals. Term signal has representative information of a term in time domain since it describes when the term occurs and its frequency in a certain region. Applying this approach, for each document in the dataset, a set of term signals is computed as a sequence of integers, which shows the term frequency in a specific region of a document. In document d, the representation of a query term t signal is:
where
\({f_{t,b,d}}\): is frequency of term t in segment b of document d for 1 \(\le \) b \(\le \) B.
2.2 Weighting Scheme
Zobel and Moffat [20] provided a more general structure of the form AB-CDE-FGH, to allow for the many different scoring functions that had appeared. Their study compared 720 different weighting schemes based on the TREC document set. Their results showed that the BD-ACI-BCA method was the overall best performer. The computed segment components are weighted using a BD-ACI-BCA weighting scheme as follows:
The term signal will be correspond to the weighted term signal as follows:
2.3 Document Segmentation
Document segmentation is very important issue in information retrieval specially when using any transformation algorithms such as DWT because it determine the acceptance distance between two query terms in a document to be considered in the same region (bin, line, paragraph, etc.). Three suggested segmentation were introduced in [21]. The suggested segmentation includes:
-
Sentence-based segmentation.
-
Paragraph-based segmentation.
-
Fixed length segmentation.
Before diving into details of segmentation, one should understand that if the query terms are found in the same segment/bin the wavelet transform algorithm should gives higher score than if there are scattered in the document. That is why segmentation is a very important issue. The sentence-based segmentation and paragraph-based segmentation were firstly investigated by [21] to give a real meaning of proximity. One of the limitations with wavelet transform algorithm is that it requires exact \(2^n\) where \(n \in Z\) number of segments (bins maybe used as a synonym). There is no need for zero padding process in the third item, fixed length segmentation, because the number of segments used is in the form of \(2^n\) where \(n \in Z\).
The number of segments when using the fixed length segmentation relies too heavily on the nature of the dataset used. Different length of documents require different size of segments. For instance, datasets such as chats, reviews or microblogs require very small number of bins (4, 8 or 16) [22]. Dataset such as TREC requires larger number of bins (8, 16 or 32) which used in [21] as well as in [23].
Also, another problem relies on the nature of dataset is the definition of document. Is the document is the whole chat or a single message. While another dataset such as the Holy Quran, document maybe a verse, a chapter, a related topic (sequence of verses) or a page [24].
The problem was in the sentence-based segmentation and paragraph-based segmentation because the number of segments is not necessary in the form of \(2^n\). In this case, zero padding process is needed. Zero padding refers to adding zeros to end of a term signal to increase its length. There is a need to increase the length of term signal if the number of segments used (based on sentence or paragraph) is in between \(2^{n-1}\) and \(2^n\) where \(n \in Z\). Much of inaccurate results from the suggested information retrieval model stems from using zero padding process in term signal.
There are two different behaviors regarding zero padding from Haar and Daubechies wavelet transform algorithms:
-
The Haars decomposition are discontinuous functions that are not successful in smoothing zero padding.
-
Daubechies supports orthogonal wavelets with a preassigned degree of smoothness.
2.4 Wavelet Transform Algorithm
The wavelet transform algorithm has been used in many domains such as image processing, edge detection, data compression, image compression, video compression, audio compression, compression of electroencephalography (EEG) signals, compression of electrocardiograph (ECG) signals, denoising and filtering, and finally in information retrieval and text mining. Wavelet transform algorithm has the ability to decompose natural signals into scaled and shifted versions of itself [25]. Wavelets are defined by the wavelet function \(\psi (t)\) and scaling function \(\varphi (t)\) in the time domain. The wavelet function is described as \(\psi \in \mathbf {L}^2\mathbb {R}\) (where \(\mathbf {L}^2\mathbb {R}\) is the set of functions f(t) which satisfy \(\int \mid f(t)\mid ^2 dt < \infty \)) with a zero average and norm of 1. A wavelet can be scaled and translated by adjusting the parameters s and u, respectively.
The scaling factor keeps the norm equal to one for all s and u. The wavelet transform of \(f\in \mathbf {L}^2\mathbb {R}\) at time u and scale s is:
where s and \(u \in \mathbf {Z}\) and \(\psi ^*\) is the complex conjugate of \(\psi \) [17].
The scaling function is described as \(\varphi _{u,s}(t) \in V_n\). The scaling function satisfies the following:
The filter bank tree-structured algorithm can use to compute the DWT.The signal f(t) can be transformed using the filter bank tree of DWT:
where \(A^s\) correspond to the approximation sub-signal and \(D^s\) corresponds to a detail sub-signal in the sth level of transforms of DWT of the signal f(t), respectively.
2.4.1 Haar Wavelet Transform
One of the simplest wavelet transforms is Haar transform [26]. The Haar transform is derived from the Haar matrix. It provides the different levels of resolution of a signal. At different resolutions, the positions of the terms are showed by the transformed signal. Every possible wavelet scaled and shifted version is take in Haar transform. Given a term signal \(\tilde{f}_{t,d}\) and number of segments N, the wavelet components will be \(H_N\tilde{f}_{t,d}\) where \(H_N\) is the Haar matrix. When \(N=4\) we have
2.4.2 Daubechies Wavelet Transform
In this research, we are taking the advantage of smoothing the transform of term signal by using Daubechies wavelet transform. There are many types of Daubechies transforms. The simplest one is Daub4 wavelet. The Daub4 wavelet transform is defined as the same way as the Haar wavelet transform and differs in how scaling functions and wavelets are defined.
The following code compute the Daubechies wavelet transform (Daub4) using Python. The main function that can be called is daub4Transform. The function can be called with one parameter which is the term signal. The length of term signal must equal to \(2^n\) and \(n \in Z\). The following is an example of calling the function:
print (daub4Transform([1,2,3,0,0,0,0,0]))
The complexity time of matrix multiplication that causes this transformation is \(O(N^2)\) for signals of N elements. To enhance the time of this process, one may use the wavelets scaling function and the wavelet function.
3 Design Issues and Implementation of Information Retrieval Using DWT
3.1 Problems and Design Issues
Park in [15,16,17] assumes that documents have fixed length as well as they are unformatted. Using fixed number of bins in all dataset has many disadvantages:
-
1.
It does not distinguish between long documents and short documents. Suppose using of fixed number of bins B such that \(B = 2^i\) and \(i \in \mathbb {N}\). For any document d, has a number of words |d|, the bin b contains number of words \(|b| = \frac{|d|}{B}\). That means there are different levels of proximity for different documents. Number of bins in microblogs [22] should be very small in comparison to the number of bins in scientific theses.
-
2.
It does not consider the length of a query. If all terms of a query are found in one bin of a document, that means the document is more related to the query than a document holding the same query terms scattered in more than one bin. To consider this issue, the length of the suggested bin should be \(|b| \ge |Q|\).
-
3.
It does not consider the semantic representation of documents. A paragraph, which is a unit of thought that is focused on a single topic, may represent semantically a single bin. Also, a sentence may represent semantically a single bin.
-
4.
Also, it does not consider the weight of matched query terms in a document according to it terms’ location. The matched query terms in the first part of a document should have a greater weight more than the matched query terms in the middle and the end in the retrieved documents.
The main hypothesis in this research is to consider the necessity of using different types of segmentation and term weighting beside DWT to improve document score. The algorithms and techniques will be described in the following subsections.
3.2 Implementation of the Suggested Model
The suggested model was implemented using python 2.7 and Natural Language Toolkit (NLTK) 3.0 that is a leading platform for building Python programs to work with natural language text. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries.
3.3 Document Segmentation
Document may be segmented by fixed length, sentences or paragraphs. If we use fixed length the minimum distance between any occurrence of term \(t_i\) and term \(t_j\) in a document D should be considered. Suppose the document contains C words, d is the maximum distance, S represents the number of sentences and P represents the number of paragraphs. The main problem is to construct the Haar matrix in these different situations. The following algorithms show how to construct the Haar matrix.
3.4 Term Weighting
We hypothesize that the significant terms always appear in the front of a document. Formatted document may be segmented by document length, sentences, topics/subtopics or paragraphs. While unformatted documents may be segmented by document length, sentences or paragraphs. The minimum distance between any occurrence of term \(t_i\) and term \(t_j\) in a document D should be considered.
4 Experiments and Results
The TREC dataset is used to evaluate the proposed methods. Exactly, the Associated Press disk 2 and Wall Street Journal disk 2 (AP2WSJ2) has been chosen. The AP2WSJ2 set contains more than 150,000 documents. The selected query set was from 51 to 200 (in TREC 1, 2, and 3). The investigated methods will be compared with the previous high precision methods. The experiments are performed based on a variety of the following segmentation methods:
-
Fixed number of segments (8 bins).
-
Sentence-based segmentation.
-
Paragraph-based segmentation.
-
Fixed number of terms in a segment (\(d = 3\)).
-
Fixed number of terms in a segment (\(d = 10\)).
-
Paragraph-based segmentation with position weight.
These segmentation methods will be compared against another important term proximity measure namely shortest-substring retrieval (SSR) [6] and window based bi-gram BM25 model [27].
In the experiment, the TREC official evaluation measures have been used, namely the Mean Average Precision (MAP), and Average Precision at K (where \(K = \{5,10,15,20\}\)).
Table 1 shows that most of the suggested segmentation methods does not have significant improvements. All suggested segmentation methods outperform SSR method. However, they did not outperform the fixed bin segmentation method and window based bi-gram BM25 model (8 bins) while the fixed bin segmentation method outperform window based bi-gram BM25 model.
Because of using Haar wavelet transform leads to zero padding in the term signal which may affect the accuracy of the results, Daubechies also is used to improve the accuracy of the retreival.
In Paragraph-based segmentation with position weight method, two different additional weight schemes have been used for the first three paragraphs as follows (0.75, 0.50, 0.25) and (0.2, 0.1, 0.05).
Table 2 reveals the results of using the same suggested segmentation methods when using Daubechies (Daub4) wavelet transform. Compared with Haar wavelet transform, Daubechies wavelet transform outperforms Haar wavelet transform in most cases and measures. Even the values of results are hardly distinguishable from Haar wavelet transform as in Table 1, but there is a slight improvement of the results.
Also, there is a slight improvement of the results when using more smoothing, the scheme [0.2, 0.1, 0.05] gives better results than [0.75, 0.50, 0.25, 0.0], for the additional weight schemes for the position of sentences and paragraphs. This confirms the previous findings in the literature as in [28, 29].
5 Conclusion
This chapter has given an account of using DWT in information retrieval and the reasons for the reasons for using different document segmentations as well as proposing a variety of segmentation methods to enhance document score and the accuracy of retrieving. All suggested segmentation methods are based on meaning and context. All suggested segmentation methods outperform SSR method while the sentence-based, paragraph-based and sentence-based with position weight outperform window based bi-gram BM25 model. However, they did not outperform the fixed bin segmentation method (8 bins).
This research has argued the using of different wavelet transform algorithms such as Haar and Daubechies and how Daubechies slightly outperforms Haar. One of the most important future work is to find the optimum bin number and to elaborate the additional weight scheme for a given dataset.
References
Salton, G., Fox, E.A., Wu, H.: Extended Boolean information retrieval. Commun. ACM 26, 1022–1036 (1983)
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Kang, B., Kim, D., Kim, H.: Fuzzy information retrieval indexed by concept identification. In: International Conference on Text, Speech and Dialogue, pp. 179–186 (2005)
Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 18–25 (1985)
Cummins, R., O’Riordan, C.: Learning in a pairwise term-term proximity framework for information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 251–258 (2009)
Clarke, C.L.A., Cormack, G.V.: Shortest-substring retrieval and ranking. ACM Trans. Inf. Syst. (TOIS) 18, 44–78 (2000)
Hawking, D., Thistlewaite, P.: Relevance Weighting using Distance Between Term Occurrences (1996)
Bhatia, M.P.S., Khalid, A.K.: Contextual proximity based term-weighting for improved web information retrieval. In: International Conference on Knowledge Science, Engineering and Management, pp. 267–278 (2007)
Aref, W.G., Barbara, D., Johnson, S., Mehrotra, S.: Efficient processing of proximity queries for large databases. In: Proceedings of the Eleventh International Conference on Data Engineering, 1995, pp. 147–154 (1995)
EI Mahdaouy, A., Gaussier, E., EI Alaoui, S.O.: Exploring term proximity statistic for Arabic information retrieval. 2014 In: Third IEEE International Colloquium in Information Science and Technology (CIST). IEEE (2014)
Ye, Z., He, B., Wang, L., Luo, T.: Utilizing term proximity for blog post retrieval. J. Am. Soc. Inf. Sci. Technol. 64, 2278–2298 (2013)
Costa, A., Melucci, M.: An information retrieval model based on discrete fourier transform. In: Information Retrieval Facility Conference, pp. 84–99 (2010)
Ramamohanarao, K., Park, L.A.F.: Spectral-based document retrieval. In: Advances in Computer Science-ASIAN 2004. Higher-Level Decision Making, pp. 407–417. Springer (2004)
Park, L.A.F., Ramamohanarao, K., Palaniswami, M.: Fourier domain scoring: a novel document ranking method. IEEE Trans. Knowl. Data Eng. 16, 529–539 (2004)
Park, L.A.F., Palaniswami, M., Kotagiri, R.: Internet document filtering using fourier domain scoring. In: European Conference on Principles of Data Mining and Knowledge Discovery, pp. 362–373 (2001)
Park, L.A., Palaniswami, M., Ramamohanarao, K.: A novel document ranking method using the discrete cosine transform. IEEE Trans. Pattern Anal. Mach. Intell. 27, 130–135 (2005)
Park, L.A.F., Ramamohanarao, K., Palaniswami, M.: A novel document retrieval method using the discrete wavelet transform. ACM Trans. Inf. Syst. (TOIS) 23, 267–298 (2005)
Arru, G., Feltoni Gurini, D., Gasparetti, F., Micarelli, A., Sansonetti, G.: Signal-based user recommendation on twitter. In: Proceedings of the 22nd International Conference on World Wide Web Steering Committee/ACM, pp. 941–944 (2013)
Yang, T., Lee, D.: T3: On mapping text to time series. In: Proceedings of the 3rd Alberto Mendelzon Int’l Workshop on Foundations of Data Management, Arequipa, Peru, May 2009
Zobel, J., Moffat, A.: Exploring the similarity space. In: ACM SIGIR Forum, pp. 18–34 (1998)
Dahab, M., Kamel, M., Alnofaie, S.: Further Investigations for Documents Information Retrieval Based on DWT. In: International Conference on Advanced Intelligent Systems and Informatics, pp. 3–11. Springer International Publishing (2016)
Diwali, A., Kamel, M., Dahab, M: Arabic text-based chat topic classification using discrete Wavelet Transform. Int. J. Comput. Sci. Issues (IJCSI) 12, 86 (2015)
Alnofaie, S., Dahab, M., Kamal, M.: A novel information retrieval approach using query expansion and spectral-based. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 7(9), 364–373 (2016)
Aljaloud, H., Dahab, M., Kamal, M.: Stemmer impact on Quranic mobile information retrieval performance. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 7(12), 135–139 (2016)
Daubechies, I.: Where do wavelets come from? A personal point of view. Proc. IEEE 84, 510–513 (1996)
Haar, A.: Zur theorie der orthogonalen funktionen systeme. Math. Ann. 69, 331–371 (1910)
He, B., Huang, J.X., Zhou, X.: Modeling term proximity for probabilistic information retrieval models. Inf. Sci. 181(14), 3017–3031 (2011)
Hassan, H., Dahab, M., Bahnassy, K., Idrees, A., Gamal, F.: Query answering approach based on document summarization. Int. J. Mod. Eng. Res. (IJMER) 4(12), 50–55 (2014)
Hassan, H., Dahab, M., Bahnassy, K., Idrees, A., Gamal, F.: Arabic documents classification methodastep towards efficient documents summarization. Int. J. Recent Innov. Trends Comput. Commun. 3(1), 351–359 (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this chapter
Cite this chapter
Dahab, M.Y., Kamel, M., Alnofaie, S. (2018). An Empirical Study of Documents Information Retrieval Using DWT. In: Shaalan, K., Hassanien, A., Tolba, F. (eds) Intelligent Natural Language Processing: Trends and Applications. Studies in Computational Intelligence, vol 740. Springer, Cham. https://doi.org/10.1007/978-3-319-67056-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-319-67056-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67055-3
Online ISBN: 978-3-319-67056-0
eBook Packages: EngineeringEngineering (R0)