Abstract
When gathering information from multiple independent data sources, users will generally pose a sequence of queries to each source, combine (union) or cross-reference (join) the results in order to obtain the information they need. Furthermore, when gathering information, there is a fair bit of trial and error involved, where queries are recursively refined according to the results of a previous query in the sequence. From the point of view of an outside observer, the aim of such a sequence of queries may not be immediately obvious.
We investigate the problem of isolating and characterizing subsequences representing coherent information retrieval goals out of a sequence of queries sent by a user to different data sources over a period of time. The problem has two sub-problems: segmenting the sequence into subsequences, each representing a discrete goal; and labeling each query in these subsequences according to how they contribute to the goal. We propose a method in which a discriminative probabilistic model (a Conditional Random Field) is trained with pre-labeled sequences. We have tested the accuracy with which such a model can infer labels and segmentation on novel sequences. Results show that the approach is very accurate (> 95% accuracy) when there are no spurious queries in the sequence and moderately accurate even in the presence of substantial noise (~70% accuracy when 15% of queries in the sequence are spurious).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Acar, A.C., Motro, A.: Inferring user goals from sets of independent queries in a multidatabase environment. In: Ras, Z., Tsay, L.-S. (eds.) Advances in Intelligent Information Systems. SCI, vol. 265, pp. 225–243. Springer, Heidelberg (2010)
Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of Knowledge Discovery and Data Mining, pp. 407–416 (2000)
Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report ICSI-TR-97-021, University of Berkeley (1997)
Cardiff, J., Catarci, T., Santucci, G.: Semantic query processing in a heterogeneous database environment. Journal of Intelligent and Cooperative Information Systems 6(2), 151–192 (1997)
Chen, M.-S., Park, J.S., Yu, P.S.: Efficient data mining for path traversal patterns. Knowledge and Data Engineering 10(2), 209–221 (1998)
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1(1), 5–32 (1999)
Godfrey, P., Gryz, J.: Semantic query caching for heterogeneous databases. In: Proceedings of Knowledge Representation Meets Databases, pp. 6.1–6.6 (1997)
He, D., Goker, A.: Detecting session boundaries from web user logs. In: Proceedings of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval (2000)
Jin, R., Yan, R., Zhang, J., Hauptmann, A.: A Faster Iterative Scaling Algorithm for Conditional Exponential Model. In: Proceedings of the 20th Int. Conf. on Machine Learning, pp. 282–289 (2003)
Joachims, T.: Unbiased evaluation of retrieval quality using clickthrough data. Technical report, Cornell University, Department of Computer Science (2002)
Kindermann, R., Snell, J.: Markov random fields and their applications. American Mathematical Society, Providence (1980)
Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th Int. Conf. on Machine Learning, pp. 282–289 (2001)
Levy, A.Y., Sagiv, Y.: Semantic query optimization in datalog programs. In: Proceedings of Principles of Database Systems, pp. 163–173 (1992)
Liu, D., Nocedal, J.: On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45(1), 503–528 (1989)
McCallum, A.: Efficiently inducing features of conditional random fields. In: Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2003), pp. 403–411 (2003)
Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)
Wallach, H.: Efficient Training of Conditional Random Fields. Master’s thesis, University of Edinburgh (2002)
Yao, Q., Huang, X., An, A.: A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 254–264. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Acar, A.C., Motro, A. (2011). Segmenting and Labeling Query Sequences in a Multidatabase Environment. In: Meersman, R., et al. On the Move to Meaningful Internet Systems: OTM 2011. OTM 2011. Lecture Notes in Computer Science, vol 7044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25109-2_24
Download citation
DOI: https://doi.org/10.1007/978-3-642-25109-2_24
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25108-5
Online ISBN: 978-3-642-25109-2
eBook Packages: Computer ScienceComputer Science (R0)