Skip to main content

Cpriori: An Index-Based Framework to Extract the Generalized Center Strings

  • Conference paper
  • First Online:
Web Information Systems Engineering – WISE 2018 (WISE 2018)

Abstract

The common approximate substring (CAS) problem is to extract CAS in all sequences of a large sequence set. The restriction of requesting the exact match results in losing a large amount of useful information in sequential pattern mining. Instead of extracting the exact substrings, it is more significant to extract the generalized center string (GCS). The GCS is the string that can produce all other exact substrings through limited mutation. The GCS problem can be used for accurate reasoning on mutation in real-world applications (e.g. biological sequence analysis). However, this task is very challenging due to the exponentially increasing complexity after loosening the constraints. In this paper, we propose an index-based framework to solve the GCS problem using the divide-and-conquer strategy. Particularly, we propose an efficient algorithm, named CValidating, that converts the problem of pattern extracting to the problem of query processing. Moreover, a heuristic filtering strategy is devised to reduce the search space. Experimental results show that our algorithm outperforms the existing algorithms.

Supported by National Key Research and Development Program of China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://epd.vital-it.ch/index.php.

  2. 2.

    https://www.ncbi.nlm.nih.gov.

References

  1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: 1995 Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14. IEEE (1995)

    Google Scholar 

  2. Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)

    Google Scholar 

  3. Alkan, O.K., Karagoz, P.: CRoM and HuspExt: improving efficiency of high utility sequential pattern extraction. IEEE Trans. Knowl. Data Eng. 27(10), 2645–2657 (2015)

    Article  Google Scholar 

  4. Cao, H., Mamoulis, N., Cheung, D.W.: Mining frequent spatio-temporal sequential patterns. In: Fifth IEEE International Conference on Data Mining, pp. 82–89. IEEE (2005)

    Google Scholar 

  5. Chang, J.H.: Mining weighted sequential patterns in a sequence database with a time-interval weight. Know. Based Syst. 24(1), 1–9 (2011)

    Article  MathSciNet  Google Scholar 

  6. Evans, P.A., Smith, A.D., Wareham, H.T.: On the complexity of finding common approximate substrings. Theoret. Comput. Sci. 306(1–3), 407–430 (2003)

    Article  MathSciNet  Google Scholar 

  7. Fiot, C., Laurent, A., Teisseire, M.: From crispness to fuzziness: three algorithms for soft sequential pattern mining. IEEE Trans. Fuzzy Syst. 15(6), 1263–1277 (2007)

    Article  Google Scholar 

  8. Fournier-Viger, P., Gueniche, T., Tseng, V.S.: Using partially-ordered sequential rules to generate more accurate sequence prediction. In: Zhou, S., Zhang, S., Karypis, G. (eds.) ADMA 2012. LNCS (LNAI), vol. 7713, pp. 431–442. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35527-1_36

    Chapter  Google Scholar 

  9. García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A new algorithm for fast discovery of maximal sequential patterns in a document collection. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 514–523. Springer, Heidelberg (2006). https://doi.org/10.1007/11671299_53

    Chapter  Google Scholar 

  10. Ge, J., Xia, Y., Wang, J.: Mining uncertain sequential patterns in iterative mapreduce. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 243–254. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_19

    Chapter  Google Scholar 

  11. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, vol. 29, pp. 1–12. ACM (2000)

    Google Scholar 

  12. Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Disc. 8(1), 53–87 (2004)

    Article  MathSciNet  Google Scholar 

  13. Hong, T.P., Lin, K.Y., Wang, S.L.: Mining fuzzy sequential patterns from multiple-item transactions. In: 2001 Joint 9th IFSA World Congress and 20th NAFIPS International Conference, vol. 3, pp. 1317–1321. IEEE (2001)

    Google Scholar 

  14. Hufsky, F., Kuchenbecker, L., Jahn, K., Stoye, J., Böcker, S.: Swiftly computing center strings. In: Moulton, V., Singh, M. (eds.) WABI 2010. LNCS, vol. 6293, pp. 325–336. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15294-8_27

    Chapter  Google Scholar 

  15. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45061-0_73

    Chapter  Google Scholar 

  16. Lan, G.C., Hong, T.P., Tseng, V.S., Wang, S.L.: Applying the maximum utility measure in high utility sequential pattern mining. Exp. Syst. Appl. 41(11), 5071–5081 (2014)

    Article  Google Scholar 

  17. Lanctot, J.K., Li, M., Ma, B., Wang, S., Zhang, L.: Distinguishing string selection problems. Inf. Comput. 185(1), 41–55 (2003)

    Article  MathSciNet  Google Scholar 

  18. Liu, A.X., Shen, K., Torng, E.: Large scale hamming distance query processing. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 553–564. IEEE (2011)

    Google Scholar 

  19. Lu, R., Jia, C., Zhang, S., Chen, L., Zhang, H.: An exact data mining method for finding center strings and all their instances. IEEE Trans. Knowl. Data Eng. 19(4), 509–522 (2007)

    Article  Google Scholar 

  20. Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 797–808. ACM (2013)

    Google Scholar 

  21. Muzammal, M.: Mining sequential patterns from probabilistic databases by pattern-growth. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds.) BNCOD 2011. LNCS, vol. 7051, pp. 118–127. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24577-0_12

    Chapter  Google Scholar 

  22. Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 549–558. ACM (2013)

    Google Scholar 

  23. Shang, J., Peng, J., Han, J.: Macfp: Maximal approximate consecutive frequent pattern mining under edit distance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 558–566. SIAM (2016)

    Google Scholar 

  24. Smith, A.D.: Common approximate substrings. Ph.D. thesis. Citeseer (2004)

    Google Scholar 

  25. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 1–17. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0014140

    Chapter  Google Scholar 

  26. Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. IEEE Trans. Knowl. Data Eng. 19(8), 1042–1056 (2007)

    Article  Google Scholar 

  27. Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: improving the performance of approximate queries on string collections. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 377–392. ACM (2015)

    Google Scholar 

  28. Yen, T.-F., Reiter, M.K.: Traffic aggregation for malware detection. In: Zamboni, D. (ed.) DIMVA 2008. LNCS, vol. 5137, pp. 207–227. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-70542-0_11

    Chapter  Google Scholar 

  29. Yin, J., Zheng, Z., Cao, L.: USpan: an efficient algorithm for mining high utility sequential patterns. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 660–668. ACM (2012)

    Google Scholar 

Download references

Acknowledgement

This work is supported by National Key Research and Development Program of China under grant 2016YFB1000902, National Natural Science Foundation of China (No. 61232015, 61472412, 61621003).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shuhan Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, S., Hou, S., Fei, C. (2018). Cpriori: An Index-Based Framework to Extract the Generalized Center Strings. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11233. Springer, Cham. https://doi.org/10.1007/978-3-030-02922-7_32

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-02922-7_32

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-02921-0

  • Online ISBN: 978-3-030-02922-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics