Emergent XML Mining: Discovering an Efficient Mapping from XML Instances to Relational Schemas

  • Hiroshi IshikawaEmail author
Part of the Advanced Information and Knowledge Processing book series (AI&KP)

As a technology related to emergent XML mining, we propose an adaptable approach to discovery of database schemas for well-formed XML data such as EDI, news, and digital libraries, which we interchange, filter, or download for future retrieval and analysis. The generated schemas usually consist of more than one table. Our approach controls the number of tables to be divided by use of statistics of XML so that the total cost of processing queries is reduced. We generate schemas appropriate for complex data such as text formatting tags and child elements with the small maximum number of occurrences in order to reduce the number of tables. To this end, we introduce three functions NULL expectation, Large Leaf Fields, and Large Child Fields for controlling the number of tables to be divided. We described how to translate queries in XQuery into those in SQL. We also describe the concept of short paths contained by generated database schemas and their effects on the performance of query...


Short Path Query Processing Database Schema Document Type Definition Child Element 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan under Grants-in-Aid for Scientific Research (16 300 030, 19 300 026). We appreciate Mr. Takeyoshi Maku for his great efforts in the implementation and evaluation of the general ideas described in this chapter.


  1. 1.
    Barabási, A.-L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999) MathSciNetCrossRefGoogle Scholar
  2. 2.
    Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. In: Proceedings of the WWW International Conference, pp. 309–320 (2000) Google Scholar
  3. 3.
    DBLP (Digital Bibliography & Library Project): Accessed 2007
  4. 4.
    Deutsch, A., Fernandez, M., Suciu, D.: Storing semistructured data with STORED. In: Proceedings of the ACM SIGMOD International Conference, pp. 431–442 (1999) Google Scholar
  5. 5.
    Elmasri, R., Navathe, S.B.: Fundamentals of Database Systems, 3rd edn. Addison-Wesley, Longman, Boston (1999) Google Scholar
  6. 6.
    Flake, G.W., Lawrence, S., Giles, C.L.: Efficient identification of Web communities. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 150–160 (2000) Google Scholar
  7. 7.
    Florescu, D., Kossmann, D.: Storing and querying XML data using an RDBMS. IEEE Data Engineering Bulletin 22(3), 27–34 (1999) Google Scholar
  8. 8.
    Goldman, R., Widom, J.: DataGuides: Enabling query formulation and optimization in semistructured databases. In: Proceedings of the VLDB International Conference, pp. 436–445 (1997) Google Scholar
  9. 9.
    Gupta, A.K., Suciu, D.: Stream processing of XPath queries with predicates. In: Proceedings of the ACM SIGMOD International Conference, pp. 419–430 (2003) Google Scholar
  10. 10.
    Hammerschmidt, B.C.: Keyx: Selective Key-Oriented Indexing in Native XML-Databases. IOS Press, Amsterdam (2006) zbMATHGoogle Scholar
  11. 11.
    Ishikawa, H., Yokoyama, S., Isshiki, S., Ohta, M.: Project Xanadu: XML- and active-database-unified approach to distributed e-commerce. In: Proceedings of the DEXA Workshops, pp. 833–837 (2001) Google Scholar
  12. 12.
    Ishikawa, H., Ohta, M., Yokoyama, S., Nakayama, J., Katayama, K.: On the effectiveness of web usage mining for page recommendation and restructuring. In: Proceedings of the NODe Web and Database-Related Workshops, pp. 253–267. Springer, Berlin (2002) Google Scholar
  13. 13.
    Ishikawa, H., Yokoyama, S., Ohta, M., Katayama, K.: On mining XML structures based on statistics. In: Proceedings of International Conference on Knowledge-Based Intelligent Information and Engineering Systems, pp. 379–390. Springer, Berlin (2005) CrossRefGoogle Scholar
  14. 14.
    Jiang, H., Lu, H., Wang, W., Yu, J.X.: Path materialization revisited: An efficient storage model for XML data. In: Proceedings of the Australasian Database Conference, pp. 85–94 (2002) Google Scholar
  15. 15.
    Klettke, M., Meyer, H.: XML and object-relational database systems enhancing structural mappings based on statistics. In: Lecture Notes in Computer Science, vol. 1997, pp. 151–170. Springer, Berlin (2001) Google Scholar
  16. 16.
    Ohta, M., Narita, H., Katayama, K., Ishikawa, H.: Overlapping clustering methods for a Japanese meta search engine. In: Proceedings of the IASTED International Conference on Databases and Applications, pp. 100–106 (2004) Google Scholar
  17. 17.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: Bringing order to the web, Stanford Digital Library Technologies Project (1998) Google Scholar
  18. 18.
    Schmidt, A.R., Waas, F., Kersten, M.L., Florescu, D., Manolescu, I., Carey, M.J., Busse, R.: The XML benchmark project. Technical report, INS-R0103, CWI (2001) Accessed 2007
  19. 19.
    Shanmugasundaram, J., Tufte, K., He, G., Zhang, C., DeWitt, D., Naughton, J.: Relational databases for querying XML documents: Limitations and opportunities. In: Proceedings of the VLDB International Conference, pp. 302–314 (1999) Google Scholar
  20. 20.
    Takekawa, H., Ishikawa, H.: Incrementally-updatable stream processors for XPath queries based on merging automata via ordered hash-keys. In: Proceedings of the DEXA Workshops, pp. 40–44 (2007) Google Scholar
  21. 21.
    Tekli, J., Chbeir, R., Yétongnon, K.: Efficient XML structural similarity detection using sub-tree commonalities. In: Proceedings of the Brazilian Symposium on Databases, ACM SIGMOD DiSC, pp. 116–130 (2007) Google Scholar
  22. 22.
    Tian, F., De Witt, D.J., Chen, J., Zhang, C.: The design and performance evaluation of alternative XML storage strategies. SIGMOD Record 31(1), 5–10 (2002) CrossRefGoogle Scholar
  23. 23.
    XHTML: Accessed 2007
  24. 24.
    XML: Accessed 2007
  25. 25.
    XML Schema: Accessed 2007
  26. 26.
    XPath: Accessed 2007
  27. 27.
    XQuery: Accessed 2007
  28. 28.
    Yokoyama, S., Ohta, M., Katayama, K., Ishikawa, H.: An access control method based on the prefix labeling scheme for XML repositories. In: Proceedings of the Australasian Database Conference, vol. 39, pp. 105–113. ACM, New York (2005) Google Scholar
  29. 29.
    Yoshikawa, M., Amagasa, T.: XRel: A path-based approach to storage and retrieval of XML documents using relational databases. ACM Transactions on Internet Technology 1(1), 110–141 (2001) CrossRefGoogle Scholar
  30. 30.
    Zhang, C., Naughton, J.F., DeWitt, D.J., Luo, Q., Lohman, G.M.: On supporting containment queries in relational database management systems. In: Proceedings of the ACM SIGMOD International Conference, pp. 425–436 (2001) Google Scholar

Copyright information

© Springer-Verlag London 2010

Authors and Affiliations

  1. 1.Faculty of InformaticsShizuoka UniversityHamamatsuJapan

Personalised recommendations