Skip to main content
Log in

Efficient regular expression matching on LZ77 compressed strings using negative factors

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

The state-of-the-art approaches for regular expression matching on LZ78 compressed strings do not perform efficiently. Moreover, LZ78 compression has some shortcomings, such as higher compression ratio and slower decompression speed than LZ77 (a variant of LZ78). In this paper, we study regular expression matching on LZ77 compressed strings. To address this problem, we propose an efficient algorithm, namely, RELZ, utilizing the positive factors, i.e., the prefix and the suffix, and negative factors (Negative factors are substrings that cannot appear in an answer.) of the regular expression to prune the candidates. For the sake of quickly locating these two kinds of factors on the compressed string without decompression, we design a variant of suffix trie index, called SSLZ. We construct bitmaps for factors of regular expression to detect candidates. Moreover, due to the high space cost of SSLZ, we propose a variant index that partially maintain suffixes of the phrases with high frequency and develop an efficient regular expression algorithm based on the novel index, namely, RELZ+. In addition, two optimization strategies employing block filtering and LZ filtering are proposed to prune false negative candidates. At last, we conduct a comprehensive performance evaluation depending on four real data sets to validate our ideas and the proposed algorithms. The experimental results show that our RELZ and RELZ+ algorithms significantly outperform the existing algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20

Similar content being viewed by others

Notes

  1. In general, there exist several occurrences for a factor.

References

  1. Becchi, M., Bremler-Barr, A., Hay, D., Kochba, O., Koral, Y.: Accelerating regular expression matching over compressed http. In: 2015 IEEE Conference on Computer Communications (INFOCOM), pp. 540–548. IEEE (2015)

  2. Bille, P., Fagerberg, R., Gortz, I.L.: Improved approximate string matching and regular expression matching on ziv-lempel compressed texts. In: Proceedings of the 18th Annual Conference on Combinatorial Pattern Matching, pp. 52–62 (2007)

  3. Bille, P., Farach-Colton, M.: Fast and compact regular expression matching. Theor. Comput. Sci. 409(3), 486–496 (2008)

    Article  MathSciNet  Google Scholar 

  4. GNUgrep: Haertel, mike. www.gnu.org/software/grep/manual/

  5. González, R., Grabowski, S., Mäkinen, V., Navarro, G.: Practical implementation of rank and select queries. In: Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pp. 27–38 (2005)

  6. Han, Y., Wang, B., Yang, X., Zhu, H.: Efficient regular expression matching on compressed strings. In: International Conference on Database Systems for Advanced Applications, pp. 219–234. Springer (2017)

  7. Hulo, N., Bairoch, A., Bulliard, V., Cerutti, L., De Castro, E., Langendijk-Genevaux, P.S., Pagni, M., Sigrist, C.J.: The prosite database. Nucleic Acids Res. 34(suppl_1), D227–D230 (2006)

    Article  Google Scholar 

  8. Kreft, S., Navarro, G.: Self-index based on lz77 (thesis). arXiv preprint arXiv:1112.4578 (2011)

  9. Kreft, S., Navarro, G.: Self-indexing based on lz77. In: Combinatorial Pattern Matching, pp. 41–54. Springer (2011)

  10. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Inf. Theory 22(1), 75–81 (1976)

    Article  MathSciNet  Google Scholar 

  11. Li, Z., Wang, H., Shao, W., Li, J., Gao, H.: Repairing data through regular expressions. Proc. VLDB Endow. 9(5), 432–443 (2016)

    Article  Google Scholar 

  12. Navarro, G.: Nr-grep: a fast and flexible pattern-matching tool. Softw. Pract. Exp. 31(13), 1265–1312 (2001)

    Article  Google Scholar 

  13. Navarro, G.: Regular expression searching over ziv-lempel compressed text. In: Annual Symposium on Combinatorial Pattern Matching, pp. 1–17. Springer (2001)

  14. Navarro, G.: Regular expression searching on compressed text. J. Discrete Algoritms 1(5–6), 423–443 (2003)

    Article  MathSciNet  Google Scholar 

  15. Navarro, G.: A self-index on block trees. In: International Symposium on String Processing and Information Retrieval, pp. 278–289. Springer (2017)

  16. Navarro, G., Raffinot, M.: Fast regular expression search. In: International Workshop on Algorithm Engineering, pp. 198–212 (1999)

    Chapter  Google Scholar 

  17. Navarro, G., Raffinot, M.: Compact DFA Representation for Fast Regular Expression Search. Springer, Berlin (2001)

    Book  Google Scholar 

  18. Schneeberger, K., Hagmann, J., Ossowski, S., Warthmann, N., Gesing, S., Kohlbacher, O., Weigel, D.: Simultaneous alignment of short reads against multiple genomes. Genome Biol. 10(9), R98 (2009)

    Article  Google Scholar 

  19. Thompson, K.: Programming techniques: regular expression search algorithm. Commun. ACM 11(6), 419–422 (1968)

    Article  Google Scholar 

  20. Wang, K., Li, J.: Towards fast regular expression matching in practice. ACM SIGCOMM Comput. Commun. Rev. 43(4), 531–532 (2013)

    Article  MathSciNet  Google Scholar 

  21. Wu, S.: Fast text searching: allowing errors. Commun. ACM 35(10), 83–91 (1992)

    Article  Google Scholar 

  22. Xu, C., Chen, S., Su, J., Yiu, S., Hui, L.C.: A survey on regular expression matching for deep packet inspection: applications, algorithms, and hardware platforms. IEEE Commun. Surv. Tutor. 18(4), 2991–3029 (2016)

    Article  Google Scholar 

  23. Yang, X., Qiu, T., Wang, B., Zheng, B., Wang, Y., Li, C.: Negative factor: improving regular-expression matching in strings. ACM Trans. Database Syst. 40(4), 25:1–25:46 (2016)

    Article  MathSciNet  Google Scholar 

  24. Yang, X., Wang, B., Li, C., Wang, J.: Efficient direct search on compressed genomic data. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 961–972 (2013)

  25. Yang, X., Wang, B., Qiu, T., Wang, Y., Li, C.: Improving regular-expression matching on strings using negative factors. In: ACM SIGMOD International Conference on Management of Data, pp. 361–372 (2013)

  26. Yu, F., Chen, Z., Diao, Y., Lakshman, T., Katz, R.H.: Fast and memory-efficient regular expression matching for deep packet inspection. In: ACM/IEEE Symposium on Architecture for Networking and Communications Systems, 2006. ANCS 2006, pp. 93–102. IEEE (2006)

  27. Zhang, M., Zhang, Y., Hou, C.: Compact representations of automata for regular expression matching. Inf. Process. Lett. 116(12), 750–756 (2016)

    Article  MathSciNet  Google Scholar 

  28. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)

    Article  MathSciNet  Google Scholar 

  29. Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

The work is partially supported by the National Natural Science Foundation of China (Nos. 61572122, U1736104 , 61532021), Liaoning BaiQianWan Talents Program, and the Fundamental Research Funds for the Central Universities (No. N171602003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Wang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Special Issue on Web and Big Data

Guest Editors: Junjie Yao, Bin Cui, Christian S. Jensen, and Zhe Zhao

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, Y., Wang, B., Yang, X. et al. Efficient regular expression matching on LZ77 compressed strings using negative factors. World Wide Web 22, 2519–2543 (2019). https://doi.org/10.1007/s11280-019-00667-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-019-00667-z

Keywords

Navigation