Learning Concise Relax NG Schemas Supporting Interleaving from XML Documents

Li, Yeting; Mou, Xiaoying; Chen, Haiming

doi:10.1007/978-3-030-05090-0_26

Yeting Li^16,17,
Xiaoying Mou^16,17 &
Haiming Chen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11323))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1556 Accesses
5 Citations

Abstract

Relax NG is a popular and powerful schema language for XML, which concerns the relative order among the elements. Since many XML documents in practice either miss schemas or lack valid schemas, we focus on inferring a concise Relax NG from some XML documents. The fundamental task of Relax NG inference is learning regular expressions. Previous work in this direction lacks support of all operators allowed in Relax NG especially for interleaving. In this paper, by analysis of large-scale real-world Relax NG, we propose a restricted subclass of regular expressions called chain regular expressions with interleaving (ICREs). Meanwhile, we develop a learning algorithm to infer a descriptive generalized ICRE from XML samples, based on single occurrence automata and the maximum clique. We conduct experiments on real benchmark from DBLP. Experimental results show that ICREs are expressive enough to cover the vast majority of practical Relax NG. Our algorithm can effectively learn from small and large dataset, and our results are concise and more precise than other popular methods.

H. Chen—Work supported by the National Natural Science Foundation of China under Grant No. 61472405.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Where \(m_i\in \mathbb {N},n_i\in \mathbb {N}\setminus \{0\}\cup \{{\infty \}}\) and \(i=\{1,2\cdots \}.\).
2.
The content model of XSD must be deterministic expressions formally defined in [9]. So nondeterministic forms like \(((day^{?},month) | (month, day^{?}))^{?}, year\) are illegal.
3.
http://dblp.uni-trier.de/xml/.
4.
The learning results are not unique, but GenICRE only returns one of the solutions.
5.
m is the length of the expression excluding operators, \(\emptyset \) and \(\varepsilon \) [4].

References

Abiteboul, S., Buneman, P., Suciu, D.: Data on the Web: from Relations to Semistructured Data and XML. Morgan Kaufmann, Burlington (2000)
Google Scholar
Barbosa, D., Mignet, L., Veltri, P.: Studying the XML web: gathering statistics from an XML sample. World Wide Web-Internet Web Inf. Syst. 9(2), 187–212 (2006)
Article Google Scholar
Beek, M.H.T., Kleijn, J.: Infinite unfair shuffles and associativity. Theor. Comput. Sci. 380(3), 401–410 (2007)
Article MathSciNet Google Scholar
Bex, G.J., Gelade, W., Neven, F., Vansummeren, S.: Learning deterministic regular expressions for the inference of schemas from XML data. ACM Trans. Web 4(4), 1–32 (2010)
Article Google Scholar
Bex, G.J., Neven, F., Bussche, J.V.D.: DTDs versus XML schema: a practical study. In: International Workshop on the Web and Databases, pp. 79–84 (2004)
Google Scholar
Bex, G.J., Neven, F., Schwentick, T., Vansummeren, S.: Inference of concise regular expressions and DTDs. ACM Trans. Database Syst. 35(2), 1–47 (2010)
Article Google Scholar
Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: International Conference on Very Large Data Bases, University of Vienna, Austria, September, pp. 998–1009 (2007)
Google Scholar
Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: International Workshop on the Web and Databases (2015)
Google Scholar
Brüggemann-Klein, A.: Unambiguity of extended regular expressions in SGML document grammars. In: Lengauer, T. (ed.) ESA 1993. LNCS, vol. 726, pp. 73–84. Springer, Heidelberg (1993). https://doi.org/10.1007/3-540-57273-2_45
Chapter Google Scholar
Che, D., Aberer, K., Özsu, M.T.: Query optimization in XML structured-document databases. VLDB J. 15(3), 263–289 (2006)
Article Google Scholar
Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. Computer Science (2013)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn, p. 1297C1305 (2001)
Google Scholar
Demany, D.: InstanceToSchema: a RELAX NG schema generator from XML instances (2003). http://www.xmloperator.net/i2s/
Feige, U.: Approximating maximum clique by removing subgraphs. SIAM J. Discret. Math. 18(2), 219–225 (2006)
Article MathSciNet Google Scholar
Fernau, H.: Algorithms for learning regular expressions. Inf. Comput. 207(4), 521–541 (2009)
Article Google Scholar
Florescu, D.: Managing semi-structured data. ACM Queue 3(8), 18–24 (2005)
Article Google Scholar
Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57(4), 1114–1158 (2015)
Article MathSciNet Google Scholar
Garcia, P., Vidal, E.: Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 12(9), 920–925 (2002)
Article Google Scholar
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman and Company, New York (1979)
MATH Google Scholar
Garofalakis, M., Gionis, A., Shim, K., Shim, K., Shim, K.: XTRACT: learning document type descriptors from XML document collections. Data Mining Knowl. Discov. 7(1), 23–56 (2003)
Article MathSciNet Google Scholar
Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, Texas, USA, 16–18 May 2000, pp. 165–176 (2000)
Google Scholar
Gold, E.M.: Language identification in the limit. Inf. Control 10(5), 447–474 (1967)
Article MathSciNet Google Scholar
Grijzenhout, S., Marx, M.: The quality of the XML web. Web Semant.: Sci. Serv. Agents World Wide Web 19, 59–68 (2013)
Article Google Scholar
Hopcroft, J.E., Motwani, R., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Series in Computer Science, 2nd edn. Addison-Wesley-Longman, Boston (2001). ISBN: 978-0-201-44124-6
MATH Google Scholar
Clark, J., Murata, M.: Organization for the Advancement of Structured Information Standards (OASIS). Relax NG specification (2001)
Google Scholar
Kim, G.-H., Ko, S.-K., Han, Y.-S.: Inferring a relax NG schema from XML documents. In: Dediu, A.-H., Janoušek, J., Martín-Vide, C., Truthe, B. (eds.) LATA 2016. LNCS, vol. 9618, pp. 400–411. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30000-9_31
Chapter Google Scholar
Koch, C., Scherzinger, S., Schweikardt, N., Stegmaier, B.: Schema-based scheduling of event processors and buffer minimization for queries on structured data streams. In: Thirtieth International Conference on Very Large Data Bases, pp. 228–239 (2004)
Chapter Google Scholar
Manolescu, I., Florescu, D., Kossmann, D.: Answering XML queries on heterogeneous data sources. In: International Conference on Very Large Data Bases, pp. 241–250 (2001)
Google Scholar
Martens, W., Neven, F.: Typechecking top-down uniform unranked tree transducers. In: International Conference on Database Theory, pp. 64–78 (2003)
Google Scholar
Martens, W., Neven, F.: Frontiers of tractability for typechecking simple XML transformations. In: ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 23–34 (2004)
Google Scholar
Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 104–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25255-1_9
Chapter Google Scholar
Quinlan, J.R., Rivest, R.L.: Inferring decision trees using the minimum description length principle. Inf. Comput. 80(3), 227–248 (1989)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, 100190, China
Yeting Li, Xiaoying Mou & Haiming Chen
University of Chinese Academy of Sciences, Beijing, China
Yeting Li & Xiaoying Mou

Authors

Yeting Li
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoying Mou
View author publications
You can also search for this author in PubMed Google Scholar
Haiming Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

University of Connecticut, Storrs, CT, USA
Guojun Gan
Nanjing University of Aeronautics and Astronautics, Nanjing, China
Bohan Li
The University of Queensland, Brisbane, QLD, Australia
Xue Li
Beijing Institute of Technology, Beijing, China
Shuliang Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, Y., Mou, X., Chen, H. (2018). Learning Concise Relax NG Schemas Supporting Interleaving from XML Documents. In: Gan, G., Li, B., Li, X., Wang, S. (eds) Advanced Data Mining and Applications. ADMA 2018. Lecture Notes in Computer Science(), vol 11323. Springer, Cham. https://doi.org/10.1007/978-3-030-05090-0_26

Download citation

DOI: https://doi.org/10.1007/978-3-030-05090-0_26
Published: 29 December 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-05089-4
Online ISBN: 978-3-030-05090-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics