Abstract
The popularization of the Web has made a huge volume of data available for a large audience. In a large number of Web sites, such as bookstores, electronic catalogs, travel agencies, etc., the pages constitute documents which are composed of pieces of data whose overall structure can be easily recognized. Such pages are called data-rich and can be seen as collections of complex objects. In this paper, we show how such objects can be represented by nested tables, which are simple, intuitive, and quite convenient for expressing their implicit structure. The assumption is that, for most sites of interest, only few examples are required to reveal the structure of the objects. To corroborate our assumption, we describe a data extraction tool that adopts this approach and present results of some experiments carried out with this tool.
This work is supported by Project SIAM (grant MCT/FINEP/PRONEX 76.97.1016.00) and by individual research grants from CNPq and CAPES.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases. Addison-Wesley, Reading, Massachusetts, 1995.
Buneman, P. Semistructured Data. In Proceedings of the Sixteenth ACM SIGMOD Symposium on Principles of Database Systems (Tucson, Arizona, 1997), pp. 117–121.
Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A Query Language and Optimization Techniques for Unstructured Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data (Quebec, Canada, 1996), pp. 505–516.
Buneman, P., Deutsch, A., and Tan, W. A Deterministic Model for Semistructured Data. In Proceedings of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats (Jerusalem, Israel, 1999).
da Silva, A.S. Example-based Extraction and Integration of Semi-Structured Data. Ph.D. Thesis Proposal, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000. In preparation.
Embley, D. W., Campbell, D. M., Jiang, Y. S., Liddle, S. W., Ng, Y.-K., Quass, D., and Smith, R. D. Conceptual-model-based data extraction. Data & Knowledge Engineering 31, 3 (1999), 227–251.
Jaeschke, G., and Schek, H.-J. Remarks on the algebra of non first normal form relations. In Proceedings of the ACM Symposium on Principles of Database Systems (Los Angeles, California, 1982), ACM, pp. 124–138.
Laender, A. H. F., Ribeiro-Neto, B., and da Silva, A. S. DEByE-Data Extraction By Example. Technical Report, Department of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 2000.
Libkin, L. A Relational Algebra for Complex Objects Based on Partial Information. In Proceedings of the Third Symposium on Mathematical Fundamentals of Database and Knowledge Systems (Rostock, Germany, 1991), pp. 29–43.
Lorentzos, N. A., and Dondis, K. A. Query by Example for Nested Tables. In Proceedings of the 9th International Conference in Database and Experts Systems Applications(Vienna, Austria, 1998), pp. 716–725.
Nestorov, S., Abiteboul, S., and Motwani, R. Inferring Structure in Semistructured Data. SIGMOD Record 26, 4 (1997), 39–43.
Nestorov, S., Abiteboul, S., and Motwani, R. Extracting Schema from Semistructured Data. In Proceedings of the ACM SIGMOD Conference on Management of Data (Seatle, Washington, 1998), pp. 256–306.
P. Buneman and W. Fan and S. Weinstein. Interaction between Path and Type Constraints. In Proceedings of ACM Symposium on Principles of Database Systems (PODS) (Philadephia, Pennsylvania, 1999), pp. 56–67.
Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. Object Exchange Across Heterogeneous Information Sources. In Proceedings of the Eleventh International Conference on Data Engineering(Taipei, Taiwan, 1995).
Ribeiro-Neto, B., Laender, A. H. F., and da Silva, A. S. Extracting Semi-Structured Data Through Examples. In Proceedings of the Eighth ACM International Conference on Information and Knowledge Management-CIKM’99 (Kansas City, Missouri, 1999), pp. 94–101.
Silva, E. S. Example-Based Semi-Structured Data Extraction. Master’s Thesis, Departament of Computer Science, Federal University of Minas Gerais, Belo Horizonte, Brazil, 1999. In Portuguese.
van Gucht, D., and Fischer, P. C. Multilevel nested relational structures. Journal of Computer and System Sciences 36, 1 (1988), 77–105.
Wang, K., and Liu, H. Schema Discovery for Semistructured Data. In Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (KDD-97) (Newport Beach, California, 1997), pp. 271–274.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S., Silva, E.S. (2000). Representing Web Data as Complex Objects. In: Bauknecht, K., Madria, S.K., Pernul, G. (eds) Electronic Commerce and Web Technologies. EC-Web 2000. Lecture Notes in Computer Science, vol 1875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44463-7_19
Download citation
DOI: https://doi.org/10.1007/3-540-44463-7_19
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67981-3
Online ISBN: 978-3-540-44463-3
eBook Packages: Springer Book Archive