Skip to main content

Segmenting and Labeling Query Sequences in a Multidatabase Environment

  • Conference paper
Book cover On the Move to Meaningful Internet Systems: OTM 2011 (OTM 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7044))

Abstract

When gathering information from multiple independent data sources, users will generally pose a sequence of queries to each source, combine (union) or cross-reference (join) the results in order to obtain the information they need. Furthermore, when gathering information, there is a fair bit of trial and error involved, where queries are recursively refined according to the results of a previous query in the sequence. From the point of view of an outside observer, the aim of such a sequence of queries may not be immediately obvious.

We investigate the problem of isolating and characterizing subsequences representing coherent information retrieval goals out of a sequence of queries sent by a user to different data sources over a period of time. The problem has two sub-problems: segmenting the sequence into subsequences, each representing a discrete goal; and labeling each query in these subsequences according to how they contribute to the goal. We propose a method in which a discriminative probabilistic model (a Conditional Random Field) is trained with pre-labeled sequences. We have tested the accuracy with which such a model can infer labels and segmentation on novel sequences. Results show that the approach is very accurate (> 95% accuracy) when there are no spurious queries in the sequence and moderately accurate even in the presence of substantial noise (~70% accuracy when 15% of queries in the sequence are spurious).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Acar, A.C., Motro, A.: Inferring user goals from sets of independent queries in a multidatabase environment. In: Ras, Z., Tsay, L.-S. (eds.) Advances in Intelligent Information Systems. SCI, vol. 265, pp. 225–243. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  2. Beeferman, D., Berger, A.: Agglomerative clustering of a search engine query log. In: Proceedings of Knowledge Discovery and Data Mining, pp. 407–416 (2000)

    Google Scholar 

  3. Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models. Technical Report ICSI-TR-97-021, University of Berkeley (1997)

    Google Scholar 

  4. Cardiff, J., Catarci, T., Santucci, G.: Semantic query processing in a heterogeneous database environment. Journal of Intelligent and Cooperative Information Systems 6(2), 151–192 (1997)

    Article  Google Scholar 

  5. Chen, M.-S., Park, J.S., Yu, P.S.: Efficient data mining for path traversal patterns. Knowledge and Data Engineering 10(2), 209–221 (1998)

    Article  Google Scholar 

  6. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1(1), 5–32 (1999)

    Article  Google Scholar 

  7. Godfrey, P., Gryz, J.: Semantic query caching for heterogeneous databases. In: Proceedings of Knowledge Representation Meets Databases, pp. 6.1–6.6 (1997)

    Google Scholar 

  8. He, D., Goker, A.: Detecting session boundaries from web user logs. In: Proceedings of the BCS-IRSG 22nd Annual Colloquium on Information Retrieval (2000)

    Google Scholar 

  9. Jin, R., Yan, R., Zhang, J., Hauptmann, A.: A Faster Iterative Scaling Algorithm for Conditional Exponential Model. In: Proceedings of the 20th Int. Conf. on Machine Learning, pp. 282–289 (2003)

    Google Scholar 

  10. Joachims, T.: Unbiased evaluation of retrieval quality using clickthrough data. Technical report, Cornell University, Department of Computer Science (2002)

    Google Scholar 

  11. Kindermann, R., Snell, J.: Markov random fields and their applications. American Mathematical Society, Providence (1980)

    Book  MATH  Google Scholar 

  12. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th Int. Conf. on Machine Learning, pp. 282–289 (2001)

    Google Scholar 

  13. Levy, A.Y., Sagiv, Y.: Semantic query optimization in datalog programs. In: Proceedings of Principles of Database Systems, pp. 163–173 (1992)

    Google Scholar 

  14. Liu, D., Nocedal, J.: On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45(1), 503–528 (1989)

    Article  MATH  MathSciNet  Google Scholar 

  15. McCallum, A.: Efficiently inducing features of conditional random fields. In: Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI 2003), pp. 403–411 (2003)

    Google Scholar 

  16. Rabiner, L.R.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2), 257–286 (1989)

    Article  Google Scholar 

  17. Wallach, H.: Efficient Training of Conditional Random Fields. Master’s thesis, University of Edinburgh (2002)

    Google Scholar 

  18. Yao, Q., Huang, X., An, A.: A Machine Learning Approach to Identifying Database Sessions Using Unlabeled Data. In: Tjoa, A.M., Trujillo, J. (eds.) DaWaK 2005. LNCS, vol. 3589, pp. 254–264. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Acar, A.C., Motro, A. (2011). Segmenting and Labeling Query Sequences in a Multidatabase Environment. In: Meersman, R., et al. On the Move to Meaningful Internet Systems: OTM 2011. OTM 2011. Lecture Notes in Computer Science, vol 7044. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25109-2_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25109-2_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25108-5

  • Online ISBN: 978-3-642-25109-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics