Abstract
In vague queries, a user enters a value that represents some real world object and expects as the result the set of database values that represent this real world object even with not exact matching. The problem appears in databases that collect data from different sources or databases were different users enter data directly. Query engines usually rely on the use of some type of similarity metric to support data with inexact matching. The problem of building query engines to execute vague queries has been already studied, but an important problem still remains open, namely that of defining the threshold to be used when a similarity scan is performed over a database column. From the bibliography it is known that the threshold depends on the similarity metrics and also on the set of values being queried. Thus, it is unrealistic to expect that the user supplies a threshold at query time. In this paper we propose a process for estimation of recall/precision values for several thresholds for a database column. The idea is that this process is started by a database administrator in a pre-processing phase using samples extracted from database. The meta-data collected by this process may be used in query processing in the optimization phase. The paper describes this process as well as experiments that were performed in order to evaluate it.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Download to read the full chapter text
Chapter PDF
References
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33, 31–88 (2001)
Santini, S., Jain, R.: Similarity measures. IEEE Transaction on Pattern Analysis and Machine Intelligence 21, 871–883 (1999)
Schallehn, E., Sattler, K.-U., Saake, G.: Efficient similarity-based operations for data integration. Data Knowl. Eng. 48, 361–387 (2004)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: Proceedings of the Twelfth International Conference on World Wide Web, pp. 90–101. ACM Press, New York (2003)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D., Muthukrishnan, S.: Approximate string joins in a database (almost) for free. In: Proceedings of 27th International Conference on Very Large Data Bases, VLDB 2001, September 11-14, pp. 491–500. Morgan Kaufmann, San Francisco (2001)
Schallehn, E., Geist, I., Sattler, K.U.: Supporting similarity operations based on approximate string matching on the web. In: Meersman, R., Tari, Z. (eds.) OTM 2004. LNCS, vol. 3290, pp. 227–244. Springer, Heidelberg (2004)
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing methods for approximate string matching. IEEE Data Engineering Bulletin 24, 19–27 (2001)
Motro, A.: Vague: a user interface to relational databases that permits vague queries. ACM Trans. Inf. Syst. 6, 187–214 (1988)
Ortega-Binderberger, M.: Integrating Similarity Based Retrieval and Query Refinement in Databases. Phd thesis, UIUC - University of Illinois at Urbana-Champaign, Urbana, Illinois (2002)
Baeza-Yates, R.A., Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston (1999)
Ullman, J.D., Garcia-Molina, H., Widom, J.: Database Systems: The Complete Book. Prentice Hall Inc., Upper Saddle River (2002)
Chaudhuri, S.: An overview of query optimization in relational systems. In: Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Seattle, Washington,USA, June 1-3, pp. 34–43. ACM Press, New York (1998)
Nambiar, U., Kambhampati, S.: Answering imprecise database queries: a novel approach. In: Proceedings of the fifth ACM international workshop on web information and data management, pp. 126–133. ACM Press, New York (2003)
Dey, D., Sarkar, S.: A probabilistic relational model and algebra. ACM Trans. Database Syst. 21, 339–369 (1996)
Fuhr, N., Rolleke, T.: A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15, 32–66 (1997)
Cohen, W.W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 201–212. ACM Press, New York (1998)
de Keijzer, A., van Keulen, M.: A possible world approach to uncertain relational data. In: 15th International Workshop on Database and Expert Systems Applications (DEXA 2004) Workshops. SIUFDB 2004 1st International Workshop on Supporting Imprecision and Uncertainty in Flexible Databases. IEEE Computer Society, Los Alamitos (2004)
Lakshmanan, L.V.S., Leone, N., Ross, R., Subrahmanian, V.S.: Probview: a flexible probabilistic database system. ACM Trans. Database Syst. 22, 419–469 (1997)
Fuhr, N.: A probabilistic relational model for the integration of ir and databases. In: Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 309–317. ACM Press, New York (1993)
Barbara, D., Garcia-Molina, H., Porter, D.: A probabilistic relational data model. In: Proceedings of the international conference on extending database technology on Advances in database technology, pp. 60–74. Springer, New York (1990)
List, J., Mihajlovic, V., de Vries, A.P., Ramirez, G., Hiemstra, D.: The TIJAH XML-IR system at INEX 2003. In: Proceedings of the 2nd Initiative on the Evaluation of XML Retrieval (INEX 2003), ERCIM Workshop Proceedings (2003)
Consens, M.P., Milo, T.: Algebras for querying text regions: expressive power and optimization. J. Comput. Syst. Sci. 57, 272–288 (1998)
Fuhr, N., Grossjohann, K.: XIRQL: a query language for information retrieval in XML documents. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 172–180. ACM Press, New York (2001)
Fagin, R., Kumar, R., Sivakumar, D.: Efficient similarity search and classification via rank aggregation. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pp. 301–312. ACM Press, New York (2003)
Ortega, M., Chakrabarti, K., Mehrotra, S.: Efficient evaluation of relevance feedback for multidimensional all-pairs retrieval. In: Proceedings of the 2003 ACM Symposium on Applied computing, SAC 2003, pp. 847–852. ACM Press, New York (2003)
Chakrabarti, K., Ortega-Binderberger, M., Mehrotra, S., Porkaew, K.: Evaluating refined queries in top-k retrieval systems. IEEE Transactions on Knowledge and Data Engineering 16, 256–270 (2004)
Cohen, W.W.: Data integration using similarity joins and a word-based information representation language. ACM Trans. Inf. Syst. 18, 288–321 (2000)
Cohen, W.W., Ravikumar, P., Fienberg, S.: A comparison of string distance metrics for name-matching tasks. In: Proceedings of IJCAI 2003 Workshop on Information Integration on the Web (IIWeb 2003), Acapulco, Mexico, August 9-10, pp. 73–78. Morgan Kaufmann, San Francisco (2003)
Schallehn, E., Sattler, K.-U.: Using similarity-based operations for resolving data-level conflicts. In: BNCOD, pp. 172–189 (2003)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selection in a relational database management system. In: Bernstein, P.A. (ed.) Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, May 30 - June 1, pp. 23–34. ACM, New York (1979)
Guth, G.J.: Surname spellings and computerized record linkage. Historical Methods Newsletter 10, 10–19 (1976)
Dorneles, C.F., Lima, A.E.N., Heuser, C.A., da Silva, A., Moura, E.: Measuring similarity between collection of values. In: Proceedings of 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), Washington, DC, USA, pp. 56–63. ACM Press, New York (2004)
Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Inc., New York (1975)
Sibson, R.: SLINK: an optimally efficient algorithm for the single-link cluster method. The Computer Journal 16, 30–34 (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stasiu, R.K., Heuser, C.A., da Silva, R. (2005). Estimating Recall and Precision for Vague Queries in Databases. In: Pastor, O., Falcão e Cunha, J. (eds) Advanced Information Systems Engineering. CAiSE 2005. Lecture Notes in Computer Science, vol 3520. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11431855_14
Download citation
DOI: https://doi.org/10.1007/11431855_14
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-26095-0
Online ISBN: 978-3-540-32127-9
eBook Packages: Computer ScienceComputer Science (R0)