Skip to main content

A Large-Scale Repository of Deterministic Regular Expression Patterns and Its Applications

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2019)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11441))

Included in the following conference series:

Abstract

Deterministic regular expressions (DREs) have been used in a myriad of areas in data management. However, to the best of our knowledge, presently there has been no large-scale repository of DREs in the literature. In this paper, based on a large corpus of data that we harvested from the Web, we build a large-scale repository of DREs by first collecting a repository after analyzing determinism of the real data; and then further processing the data by using normalized DREs to construct a compact repository of DREs, called DRE pattern set. At last we use our DRE patterns as benchmark datasets in several algorithms that have lacked experiments on real DRE data before. Experimental results demonstrate the usefulness of the repository.

Work supported by the National Natural Science Foundation of China under Grant Nos. 61872339, 61472405, 61762061 and the Natural Science Foundation of Jiangxi Province, China under Grant 20161ACB20004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The number of total is not equal to the sum of DTD, XSD, RNG and RegExLib, because there exist duplicate DREs among the different types of files.

References

  1. igraph - the network analysis package. http://igraph.org/

  2. RegExLib. www.regexlib.com

  3. Software for complex networks. http://networkx.github.io/

  4. Abiteboul, S., Milo, T., Benjelloun, O.: Regular rewriting of active XML and unambiguity. In: PODS 2005, pp. 295–303. ACM (2005)

    Google Scholar 

  5. Barbosa, D., Mignet, L., Veltri, P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web 9(2), 187–212 (2006)

    Article  Google Scholar 

  6. Bex, G.J., Martens, W., Neven, F., Schwentick, T.: Expressiveness of XSDs: from practice to theory, there and back again. In: WWW 2005, pp. 712–721. ACM (2005)

    Google Scholar 

  7. Bex, G.J., Neven, F., Van den Bussche, J.: DTDs versus XML schema: a practical study. In: WebDB 2004, pp. 79–84. ACM (2004)

    Google Scholar 

  8. Bex, G.J., Neven, F., Schwentick, T., Tuyls, K.: Inference of concise DTDs from XML data. In: VLDB 2006, pp. 115–126. VLDB Endowment (2006)

    Google Scholar 

  9. Bex, G.J., Neven, F., Vansummeren, S.: Inferring XML schema definitions from XML data. In: VLDB 2007, pp. 998–1009 (2007)

    Google Scholar 

  10. Björklund, H., Martens, W., Timm, T.: Efficient incremental evaluation of succinct regular expressions. In: CIKM 2015, pp. 1541–1550. ACM (2015)

    Google Scholar 

  11. Brüggemann-Klein, A., Wood, D.: One-unambiguous regular languages. Inf. Comput. 142(2), 182–206 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  12. Chen, H., Chen, L.: Inclusion test algorithms for one-unambiguous regular expressions. In: Fitzgerald, J.S., Haxthausen, A.E., Yenigun, H. (eds.) ICTAC 2008. LNCS, vol. 5160, pp. 96–110. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-85762-4_7

    Chapter  Google Scholar 

  13. Chen, H., Lu, P.: Checking determinism of regular expressions with counting. Inf. Comput. 241, 302–320 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  14. Choi, B.: What are real DTDs like. Technical reports (CIS), p. 17 (2002)

    Google Scholar 

  15. Ciucanu, R., Staworko, S.: Learning schemas for unordered XML. arXiv:1307.6348 [cs.DB] (2013)

  16. Colazzo, D., Ghelli, G., Pardini, L., Sartiani, C.: Efficient asymmetric inclusion of regular expressions with interleaving and counting for XML type-checking. Theor. Comput. Sci. 492(2013), 88–116 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  17. Colazzo, D., Ghelli, G., Sartiani, C.: Linear time membership in a class of regular expressions with counting, interleaving, and unordered concatenation. ACM Trans. Database Syst. (TODS) 42(4), 24 (2017)

    Article  MathSciNet  Google Scholar 

  18. Freydenberger, D.D., Kötzing, T.: Fast learning of restricted regular expressions and DTDs. Theory Comput. Syst. 57, 1114–1158 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  19. Grijzenhout, S., Marx, M.: The quality of the XML web. In: CIKM 2011, pp. 1719–1724 (2011)

    Google Scholar 

  20. Huang, X., Bao, Z., Davidson, S.B., Milo, T., Yuan, X.: Answering regular path queries on workflow provenance, pp. 375–386. IEEE (2015)

    Google Scholar 

  21. Boneva, I., Ciucanu, R., Staworko, S.: Simple schemas for unordered XML. In: WebDB 2013, pp. 13–18 (2013)

    Google Scholar 

  22. Kilpeläinen, P.: Checking determinism of XML Schema content models in optimal time. Inf. Syst. 36(3), 596–617 (2011)

    Article  Google Scholar 

  23. Laender, A.H., Moro, M.M., Nascimento, C., Martins, P.: An X-ray on web-available XML schemas. ACM SIGMOD Rec. 38(1), 37–42 (2009)

    Article  Google Scholar 

  24. Li, Y., Chu, X., Mou, X., Dong, C., Chen, H.: Practical study of deterministic regular expressions from large-scale XML and schema files. In: IDEAS 2018, pp. 45–53. ACM (2018)

    Google Scholar 

  25. Li, Y., Zhang, X., Peng, F., Chen, H.: Practical study of subclasses of regular expressions in DTD and XML schema. In: Li, F., Shim, K., Zheng, K., Liu, G. (eds.) APWeb 2016. LNCS, vol. 9932, pp. 368–382. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45817-5_29

    Chapter  Google Scholar 

  26. Losemann, K., Martens, W.: The complexity of regular expressions and property paths in SPARQL. ACM Trans. Database Syst. 38(4), 24:1–24:39 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  27. Makoto, M.: RELAX NG home page (2014). http://relaxng.org/. Accessed 25 Feb 2014

  28. Peng, F., Chen, H.: Discovering restricted regular expressions with interleaving. In: Cheng, R., Cui, B., Zhang, Z., Cai, R., Xu, J. (eds.) APWeb 2015. LNCS, vol. 9313, pp. 104–115. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25255-1_9

    Chapter  Google Scholar 

  29. Peng, F., Chen, H., Mou, X.: Deterministic regular expressions with interleaving. In: Leucker, M., Rueda, C., Valencia, F.D. (eds.) ICTAC 2015. LNCS, vol. 9399, pp. 203–220. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25150-9_13

    Chapter  Google Scholar 

  30. Thompson, H.S., Beech, D., Maloney, M., Mendelsohn, N.: XML Schema part 1: structures second edition. W3C Recommendation (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiming Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chen, H., Li, Y., Dong, C., Chu, X., Mou, X., Min, W. (2019). A Large-Scale Repository of Deterministic Regular Expression Patterns and Its Applications. In: Yang, Q., Zhou, ZH., Gong, Z., Zhang, ML., Huang, SJ. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2019. Lecture Notes in Computer Science(), vol 11441. Springer, Cham. https://doi.org/10.1007/978-3-030-16142-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-16142-2_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-16141-5

  • Online ISBN: 978-3-030-16142-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics