Skip to main content
Log in

SPHINX: Schema integration by example

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

The Internet has instigated a critical need for automated tools that facilitate integrating countless databases. Since nontechnical end users are often the ultimate repositories of the domain information required to distinguish differences in data types, an effective solution must integrate simple GUI based data browsing tools and automatic mapping methods that eliminate the requirement for a technical user to supervise the process. We develop a metamodel of data integration as the basis for absorbing feedback from an end user. The schema integration algorithm draws examples from the data and learns integrating view definitions by asking a user simple yes or no questions. The metamodel enables a search mechanism that is guaranteed to converge to a correct integrating view definition without the user having to know a view definition language such as SQL or SchemaSQL, or even having to inspect the final view definition. We show how data catalog statistics, normally used to optimize queries, can be exploited to parameterize the search heuristics and improve the convergence of the learning algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abiteboul, S., Cluet, S., & Milo, T. (1997). Correspondence and translation for heterogeneous data. ICDT Conference 1997 (pp. 351–363).

  • Baumgartner, R., Flesca, S., & Gottlob, G. (2001). Visual web information extraction with Lixto. VLDB Conference 2001 (pp. 119–128).

  • Castano S., & De Antonelli, V. (1999). A schema analysis and reconciliation tool environment. IDEAS Conference 1999 (pp. 53–62).

  • Clifton C., Housman, E., & Rosenthal, A. (1997). Experience with a combined approach to attribute-matching across heterogeneous databases. IFIP TC2/WG2.6 Seventh Conference on Database Semantics (DS-7) (pp. 428–452).

  • Cluet S., Delobel, C., Siméon, J., & Smaga, K. (1998). Your mediators need data conversion! SIGMOD Conference 1998 (pp. 177–188).

  • Cohn, D., Atlas, L., & Ladner, R. (1994). Improving generalization with active learning. Machine Learning, 15(2), 201–221.

    Google Scholar 

  • Crescenzi, V., Mecca, G., & Merialdo, P. (2001). RoadRunner. Towards automatic data extraction from large web sites. VLDB Conference 2001 (pp. 109–118) .

  • Dagan, I., & Engelson, S. (1995). Committee-based sampling for training probabilistic classifiers. Proceedings of the Twelfth International Conference on Machine Learning (pp. 150–157).

  • Dhamankar, R., Lee, Y., Doan, A., Halevy, A., & Domingos, P. (2004). iMAP: Discovering complex mappings between database schemas. SIGMOD Conference 2004 (pp. 383–394).

  • Doan, A., Domingos, P., & Halevy, A. (2001). Reconciling schemas of disparate data sources: A machine learning approach. SIGMOD Conference 2001.

  • Florescu, D., Levy, A., & Mendelzon, A. (1998). Database techniques for the world-wide web: A survey. SIGMOD Record, 27(3), 59–74.

    Article  Google Scholar 

  • Garcia-Molina H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., et al. (1997). The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2), 117–132.

    Article  Google Scholar 

  • Grannis, S., Overhage, J., Hui, S., & McDonald, C. (2002). Analysis of identifier performance using a deterministic linkage algorithm. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 305–309).

  • Grannis, S., Overhage, J., Hui, S., McDonald, C. (2003). Analysis of a probabilistic record linkage technique without human review. JAMIA (Symposium Supplement) Proceedings of the American Medical Informatics Association Annual Symposium (pp. 259–263).

  • Haas, L., Kossman, D., Wimmers, E., & Yang, J. (1997). Optimizing queries across diverse data sources. VLDB Conference 1997 (pp. 276–285).

  • Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and valiant’s learning framework. Artificial Intelligence, 36(2), 177–221.

    Article  MATH  MathSciNet  Google Scholar 

  • Hirsh, H. (1991). Theoretical underpinnings of version spaces. IJCAI Conference 1991 (pp. 665–670).

  • Hirsh, H. (1994). Generalizing version spaces. Machine Learning, 17(1), 5–46.

    MATH  Google Scholar 

  • Idemstam-Almquist, P. (1990). Demand networks: An alternative representation of version spaces. Master’s thesis, Department of Computer Science and Systems Sciences, The Royal Institute of Technology and Stockholm University, Stockholm, Sweden.

  • Kent, W. (1991). Solving domain mismatch and schema mismatch problems with an object-oriented database programming language. VLDB Conference 1991 (pp. 147–160).

  • Kent, W. (1992). Profile functions and bag theory. Palo Alto: Hewlett-Packard.

    Google Scholar 

  • Krishnamurthy, R., Litwin, W., & Kent, W. (1991). Language features for interoperability of databases with schematic discrepancies. SIGMOD Conference 1991 (pp. 40–49).

  • Lakshmanan, L., Sadri, F., & Subramanian, I. (1996). SchemaSQL—A language for interoperability in relational multi-database systems. VLDB Conference 1996 (pp. 239–250).

  • Lau, T., Wolfman S., Domingos, P., & Weld, D. (2003). Programming by demonstration using version space algebra. Machine Learning, 53(1–2), 111–156.

    Article  MATH  Google Scholar 

  • Lesh, N., & Etzioni, O. (1996). Scaling up goal recognition. Proceedings of the Fifth International Conference on Principles of Knowledge Representation and Reasoning (KR’96) (pp. 244–255).

  • Levy, A., Rajaraman, A., & Ordille, J. (1996). Querying heterogeneous information sources using source descriptions. VLDB Conference 1996 (pp. 251–262).

  • Lewis, D., & Catlett, J. (1994). Heterogenous uncertainty sampling for supervised learning. Proceedings of the Eleventh International Conference on Machine Learning (pp. 148–156).

  • Li W., Clifton, C., & Liu, S. (2000). SemInt: A tool for identifying attribute correspondences in heterogeneous databases using neural network. Data and Knowledge Engineering, 33(1), 49–84.

    Article  MATH  Google Scholar 

  • MacKay, D. (1992). Information-based objective functions for active data selection, Neural Computation, 4(4), 590–604.

    Article  Google Scholar 

  • Madhavan J., Bernstein, P., & Rahm, E. (2001). Generic schema matching with cupid. VLDB Conference 2001 (pp. 49–58).

  • Miller R., Haas, L., & Hernández, M. (2000). Schema mapping as query discovery. VLDB Conference 2000 (pp. 77–88).

  • Milo, T., & Zohar, S. (1998). Using schema matching to simplify heterogeneous data translation. VLDB Conference 1998 (pp. 122–133).

  • Mitchell, T. (1977). Version spaces: A candidate elimination approach to rule learning. IJCAI Conference 1977 (pp. 305–310).

  • Mitchell, T. (1978). Version spaces: An approach to concept learning (Stanford CS report STAN-CS-78-711, HPP79-2). PhD thesis, Stanford University, Stanford, CT, December 1978.

  • Mitra P., Wiederhold, G., & Kersten, M. (2000). A graph-oriented model for articulation of ontology interdependencies. EDBT Conference 2000 (pp. 86–100).

  • Muslea, I., Minton, S., & Knoblock, C. (2000). Selective sampling with redundant views. Proceedings of the Seventeenth National Conference on Artificial Intelligence and Twelfth Conference on Innovative Applications of Artificial Intelligence (pp. 621–626).

  • Palopoli, L., Terracina, G., & Ursino, D. (2000). The system DIKE: Towards the semi-automatic synthesis of cooperative information systems and data warehouses. ADBIS-DASFAA Conference 2000 (pp. 108–117).

  • Park, Y., Han, Y., & Choi, K. (1995). Automatic thesaurus construction using Bayesian networks. CIKM Conference 1995 (pp. 212–217).

  • Popescu, A., Etzioni, O., & Kautz, H. (2003). Towards a theory of natural language interfaces to databases. International Conference on Intelligent User Interfaces (pp. 149–157).

  • Rahm, E., & Bernstein, P. (2001). A survey of approaches to automatic schema matching. VLDB Journal, 10(4), 334–350.

    Article  MATH  Google Scholar 

  • Scheuermann, P., Li, W.-S., & Clifton, C. (1998). Multidatabase query processing with uncertainty in global keys and attribute values. Journal of the American Society for Information Science, 49(3), 283–301.

    Article  Google Scholar 

  • Seung, H., Opper, M., & Sompolinsky, H. (1992). Query by committee. Proceedings of the Fifth Annual ACM Conference on Computational Learning Theory (pp. 287–294).

  • Sheth, A., & Larson, J. (1990). Federated database systems for managing distributed heterogeneous and autonomous databases. ACM Computing Surveys, 22(3), 183–236.

    Article  Google Scholar 

  • Smirnov, E. (2001). Conjunctive and disjunctive version spaces with instance-based boundary sets. PhD thesis, Dept. of Computer Science, Maastricht University, Maastricht, The Netherlands.

  • Takenobu, T., Makoto, I., & Hozumi, T. (1995). Automatic thesaurus construction based on grammatical relations. IJCAI Conference 1995 (pp. 1308–1313).

  • Thompson, C., Califf, M., & Mooney, R. (1999). Active learning for natural language parsing and information extraction. Proceedings of the Sixteenth International Conference on Machine Learning (pp. 406–414).

  • Tomasic, A., Raschid, L., & Valduriez, P. (1996). Scaling heterogeneous databases and the design of disco. ICDCS Conference 1996 (pp. 449–457).

  • Vassalos, V., & Papakonstantinou, Y. (1997). Describing and using query capabilities of heterogeneous sources. VLDB Conference 1997 (pp. 256–265).

  • Vidal, M., Raschid, L., & Gruser, J. (1998). A meta-wrapper for scaling up to multiple autonomous distributed information sources. CoopIS 1998 (pp. 148–157).

  • Yan, L., Miller, R., Haas, L., & Fagin, R. (2001). Data driven understanding and refinement of schema mappings. SIGMOD Conference 2001.

  • Yan, L., Özsu, M., & Liu, L. (1997). Accessing heterogeneous data through homogenization and integration mediators. CoopIS 1997 (pp. 130–139).

  • Zloof, M. (1977). Query-by-example: A data base language. IBM Systems Journal, 16, 324–343.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francois Barbançon.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbançon, F., Miranker, D.P. SPHINX: Schema integration by example. J Intell Inf Syst 29, 145–184 (2007). https://doi.org/10.1007/s10844-006-0011-2

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0011-2

Keywords

Navigation