Abstract
The steady growth in the size of textual document collections is a key progress-driver for modern information retrieval techniques whose effectiveness and efficiency are constantly challenged. Given a user query, the number of retrieved documents can be overwhelmingly large, hampering their efficient exploitation by the user. In addition, retaining only relevant documents in a query answer is of paramount importance for an effective meeting of the user needs. In this situation, the query expansion technique offers an interesting solution for obtaining a complete answer while preserving the quality of retained documents. This mainly relies on an accurate choice of the added terms to an initial query. Interestingly enough, query expansion takes advantage of large text volumes by extracting statistical information about index terms co-occurrences and using it to make user queries better fit the real information needs. In this respect, a promising track consists in the application of data mining methods to the extraction of dependencies between terms. In this paper, we present a novel approach for mining knowledge supporting query expansion that is based on association rules. The key feature of our approach is a better trade-off between the size of the mining result and the conveyed knowledge. Thus, our association rules mining method implements results from Galois connection theory and compact representations of rules sets in order to reduce the huge number of potentially useful associations. An experimental study has examined the application of our approach to some real collections, whereby automatic query expansion has been performed. The results of the study show a significant improvement in the performances of the information retrieval system, both in terms of recall and precision, as highlighted by the carried out significance testing using the Wilcoxon test.
Similar content being viewed by others
Notes
By analogy to the itemset terminology used in data mining for a set of items.
The AMARYLLIS project is initiated by INIST-CNRS and co-funded by AUPELF-UREF. Its goal is to evaluate French Text retrieval systems.
The Cross-Language Evaluation Forum (CLEF) promotes multilingual information access. It offers benchmark collection data for evaluating IR systems. The associated website is: http://www.clef-campaign.org/.
In this paper, we denote by |X| the cardinality of the set X.
In the remainder, T 1 \(\stackrel{c}{\Rightarrow}\) T 2 indicates that the rule T 1 \({\Rightarrow}\) T 2 has a value of confidence equal to c.
The rule T \(\Rightarrow\) ∅ is usually considered as not informative.
Available at: http://www.adrem.ua.ac.be/~goethals/software/.
Distributed by the Synapse Development Corporation.
maxsupp means that the termset must occur at most below than this user-defined threshold.
Test datasets are available at: http://fimi.cs.helsinki.fi/data.
Freely available at: http://sourceforge.net/projects/lemur/, while the information about this toolkit is available at: http://www.lemurproject.org/.
idf is the acronym of inverted document frequency.
By baseline, we refer to the baselines using tf×idf, BM25tf and OKAPI BM25 weighting schemes.
References
Agrawal, R., & Skirant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th international conference on very large databases (VLDB 1994) (pp. 478–499). Santiago, Chile.
Armstrong, W. W. (1974). Dependency structures of database relationships. In Proceedings of IFIP congress (pp. 580–583). Geneva, Switzerland.
Ashrafi, M. Z., Taniar, D., & Smith, K. (2007). Redundant association rules reduction techniques. International Journal Business Intelligence and Data Mining, 1(2), 29–63.
Balcázar, J. L. (2010). Redundancy, deduction schemes, and minimum-size bases for association rules. Logical Methods in Computer Science, 6(2:3), 1–33.
Bastide, Y., Pasquier, N., Taouil, R., Stumme, G., & Lakhal, L. (2000). Mining minimal non-redundant association rules using frequent closed itemsets. In Proceedings of the 1st international conference on computational logic (DOOD 2000), LNAI (Vol. 1861, pp. 972–986). London, UK: Springer-Verlag.
Ben Yahia, S., Hamrouni, T., & Mephu Nguifo, E. (2006). Frequent closed itemset based algorithms: A thorough structural and analytical survey. ACM-SIGKDD Explorations, 8(1), 93–104.
BenYahia, S., Gasmi, G., & Mephu Nguifo, E. (2009). A new generic basis of factual and implicative association rules. Intelligent Data Analysis, 13(4), 633–656.
Bodner, R. C., & Song, F. (1996). Knowledge-based approaches to query expansion in information retrieval. In Proceedings of the 11th Biennial conference of the Canadian society for computational studies of intelligence on advances in artificial intelligence (AI 1996), LNCS (Vol. 1081, pp. 146–158). Toronto, Ontario, Canada: Springer-Verlag.
Buckley, C., Salton, G., Allan, J., & Singhal, A. (1994). Automatic query expansion using SMART: TREC-3. In Proceedings of the 3rd text retrieval conference (TREC 1994).
Croft, B., & Yufeng, J. (1994). An association thesaurus for information retrieval. In Proceedings of the 4th international conference on computer-assisted information retrieval (RIAO 1994) (pp. 146–161). New York, USA: CID Press.
de Vries, A. P., & Roelleke, T. (2005). Relevance information: A loss of entropy but a gain for IDF? In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2005) (pp. 282–289). Salvador, Brazil: ACM Press.
El-Hajj, M., & Zaiane, O. (2005). Finding all frequent patterns starting from the closure. In Proceedings of the international conference on advanced data mining and applications (ADMA 2005) (pp. 67–74). Wuhan, China.
Fonseca, B. M., Golgher, P. B., Pôssas, B., Ribeiro-Neto, B. A., & Ziviani, N. (2005). Concept-based interactive query expansion. In Proceedings of the 14th international conference on information and knowledge management (CIKM 2005) (pp. 696–703). Bremen, Germany: ACM Press.
Ganter, B., & Wille, R. (1999). Formal concept analysis. Springer-Verlag, Heidelberg.
Grefenstette, G. (1992). Use of semantic context to produce term association lists for text retrieval. In Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1992) (pp. 89–97). Copenhagen, Denmark: ACM Press.
Guillet, F., & Hamilton, H. J. (2007). Quality measures in data mining, Vol. 43. Studies in Computational Intelligence, Springer.
Haddad, H., Chevallet, J. P., & Bruandet, M. F. (2000). Relations between terms discovered by association rules (12 pages). In Proceedings of the workshop on machine learning and textual information access in conjunction with the 4th European conference on principles and practices of knowledge discovery in databases (PKDD 2000). Lyon, France.
Haiquan, L., Jinyan, L., Wong, L., Feng, M., & Tan, Y. P. (2005). Relative risk and odds ration: A data mining perspective. In Proceedings of the 24th ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS 2005) (pp. 368–377). Baltimore, Maryland, USA: ACM Press.
Hu, J., Wang, G., Lochovsky, F. H., Sun, J-T., & Chen, Z. (2009). Understanding user’s query intent with Wikipedia. In Proceedings of the 18th international conference on world wide web (WWW 2009) (pp. 471–480). Madrid, Spain: ACM Press.
Joho, H., Sanderson, M., & Beaulieu, M. (2004). A study of user interaction with a concept-based interactive query expansion support tool. In Proceedings of the 26th European Conference on Information Retrieval Research (ECIR 2004), LNCS (Vol. 2997, pp. 42–56). Sunderland, UK: Springer-Verlag.
Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic model of information retrieval: Development and comparative experiments. Information Processing and Management, Elsevier, 36(6), 779–840.
Kryszkiewicz, M. (2002). Concise representation of frequent patterns and association rules. Habilitation dissertation, Institute of Computer Science, Warsaw University of Technology, Warsaw, Poland.
Lin, H. C., Wang, L. H., & Chen, S. M. (2008). Query expansion for document retrieval by mining additional query terms. Information and Management Sciences, 19(1), 17–30.
Liu, H., Sun, J., & Zhang, H. (2009). Post-Mining of association rules: Techniques for effective knowledge extraction. Chapter V: Post-processing for rule reduction using closed set. IGI Global Publisher.
Lucchese, C., Orlando, S., Palmerini, P., Perego, R., & Silvestri, F. (2003). kDCI: A multi-strategy algorithm for mining frequent sets. In Proceedings of the IEEE ICDM workshop on frequent itemset mining implementations (FIMI 2003). CEUR Workshop Proceedings (Vol. 90). Melbourne, Florida, USA.
Luxenburger, M. (1991). Implications partielles dans un contexte. Mathématiques, Informatique et Sciences Humaines, 29(113), 35–55.
Mitra, M., Singhal, A., & Buckley, C. (1998). Improving automatic query expansion. In Proceedings of the 21th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998) (pp. 206–214). Melbourne, Australia: ACM Press.
Pasquier, N., Bastide, Y., Taouil, R., Stumme, G., & Lakhal, L. (2005). Generating a condensed representation for association rules. Journal of Intelligent Information Systems, 24(1), 25–60.
Pfaltz, J. L., & Taylor, C. M. (2002). Scientific knowledge discovery through iterative transformation of concept lattices. In Proceedings of the workshop on discrete applied mathematics in conjunction with the 2nd SIAM international conference on data mining (SDM 2002) (pp. 65–74). Arlington, Virginia, USA.
Qui, Y., & Frei, H. P. (1993). Concept based query expansion. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1993) (pp. 160–169). Pittsburgh, PA, USA: ACM Press.
Rungsawang, A., Tangpong, A., Laohawee, P., & Khampachua, T. (1999). Novel query expansion technique using Apriori algorithm. In Proceedings of the 8th Text REtrieval Conference (TREC 1999).
Ruthven, I. (2003). Re-examining the potential effectiveness of interactive query expansion. In Proceedings of the 26th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2003) (pp. 213–220). Toronto, Canada: ACM Press.
Ruthven, I., & Lalmas, M. (2003). A survey on the use of relevance feedback for information access systems. Knowledge Engineering Review, Cambridge University Press, 18(2), 95–145.
Salton, G. (1971). The SMART retrieval system: Experiments in automatic document processing. Prentice-Hall Series in Automatic Computation, Prentice-Hall, NJ, USA.
Salton, G., & Buckely, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5), 513–523.
Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. McGraw-Hill.
Schenkel, R., & Theobald, M. (2005). Relevance feedback for structural query expansion. In Proceedings of the 4th international workshop of the initiative for the evaluation of XML retrieval (INEX 2005), LNCS (Vol. 3977, pp. 344–357). Dagstuhl Castle, Germany: Springer-Verlag.
Shi, X., & Yang, C. C. (2007). Mining related queries from web search engine query logs using an improved association rule mining model. Journal of the American Society for Information Science and Technology, Wiley, 58(12), 1871–1883.
Smucker, M. D., Allan, J., & Carterette, B. (2007). A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the 16th international conference on information and knowledge management (CIKM 2007) (pp. 623–632). Lisboa, Portugal: ACM Press.
Stumme, G., Taouil, R., Bastide, Y., Pasquier, N., & Lakhal, L. (2002). Computing Iceberg concept lattices with Titanic. Data & Knowledge Engineering, 2(42), 189–222.
Sun, R., Ong, C., & Chua, T. (2006). Mining dependency relations for query expansion in passage retrieval. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 2006) (pp. 382–389). Seattle, Washington, USA: ACM Press.
Tangpong, A., & Rungsawang, A. (2000). Applying association rules discovery in query expansion process. In Proceedings of the 4th world multi-conference on systemics, cybernetics and informatics (SCI 2000). Orlando, Florida, USA.
Voorhees, E. M. (1993). Using WordNet to disambiguate word senses for text retrieval. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1993) (pp. 171–180). Pittsburgh, PA, USA: ACM Press.
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval (SIGIR 1996) (pp. 4–11). Zurich, Switzerland: ACM Press.
Yurekli, B., Capan, G., Yilmazel, B., & Yilmazel, O. (2009). Guided navigation using query log mining through query expansion. In Proceedings of the 3rd international conference on network and system security (NSS 2009). IEEE computer society (pp. 560–564). Gold Coast, Queensland, Australia.
Zaki, M. J. (2004). Mining non-redundant association rules. Data Mining and Knowledge Discovery, 9(3), 223–248.
Zhai, C. (2001). Notes on the Lemur tfidf model. Note with Lemur 1.9 documentation. Technical report, School of Computer Science, Computer Science Department, Carnegie Mellon University (CMU), Pittsburgh, PA, USA.
Acknowledgements
We would like to thank the anonymous reviewers for their helpful comments and suggestions. We are also grateful to the Evaluations and Language resources Distribution Agency (ELDA) which kindly provided us the LE Monde 94 and ATS 94 document collections of the CLEF 2003 campaign.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Latiri, C., Haddad, H. & Hamrouni, T. Towards an effective automatic query expansion process using an association rule mining approach. J Intell Inf Syst 39, 209–247 (2012). https://doi.org/10.1007/s10844-011-0189-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-011-0189-9