Cpriori: An Index-Based Framework to Extract the Generalized Center Strings

Zhang, Shuhan; Hou, Shengluan; Fei, Chaoqun

doi:10.1007/978-3-030-02922-7_32

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11233))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1513 Accesses

Abstract

The common approximate substring (CAS) problem is to extract CAS in all sequences of a large sequence set. The restriction of requesting the exact match results in losing a large amount of useful information in sequential pattern mining. Instead of extracting the exact substrings, it is more significant to extract the generalized center string (GCS). The GCS is the string that can produce all other exact substrings through limited mutation. The GCS problem can be used for accurate reasoning on mutation in real-world applications (e.g. biological sequence analysis). However, this task is very challenging due to the exponentially increasing complexity after loosening the constraints. In this paper, we propose an index-based framework to solve the GCS problem using the divide-and-conquer strategy. Particularly, we propose an efficient algorithm, named CValidating, that converts the problem of pattern extracting to the problem of query processing. Moreover, a heuristic filtering strategy is devised to reduce the search space. Experimental results show that our algorithm outperforms the existing algorithms.

Supported by National Key Research and Development Program of China.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Agrawal, R., Srikant, R.: Mining sequential patterns. In: 1995 Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14. IEEE (1995)
Google Scholar
Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: Proceedings of the 20th International Conference on Very Large Data Bases, VLDB, vol. 1215, pp. 487–499 (1994)
Google Scholar
Alkan, O.K., Karagoz, P.: CRoM and HuspExt: improving efficiency of high utility sequential pattern extraction. IEEE Trans. Knowl. Data Eng. 27(10), 2645–2657 (2015)
Article Google Scholar
Cao, H., Mamoulis, N., Cheung, D.W.: Mining frequent spatio-temporal sequential patterns. In: Fifth IEEE International Conference on Data Mining, pp. 82–89. IEEE (2005)
Google Scholar
Chang, J.H.: Mining weighted sequential patterns in a sequence database with a time-interval weight. Know. Based Syst. 24(1), 1–9 (2011)
Article MathSciNet Google Scholar
Evans, P.A., Smith, A.D., Wareham, H.T.: On the complexity of finding common approximate substrings. Theoret. Comput. Sci. 306(1–3), 407–430 (2003)
Article MathSciNet Google Scholar
Fiot, C., Laurent, A., Teisseire, M.: From crispness to fuzziness: three algorithms for soft sequential pattern mining. IEEE Trans. Fuzzy Syst. 15(6), 1263–1277 (2007)
Article Google Scholar
Fournier-Viger, P., Gueniche, T., Tseng, V.S.: Using partially-ordered sequential rules to generate more accurate sequence prediction. In: Zhou, S., Zhang, S., Karypis, G. (eds.) ADMA 2012. LNCS (LNAI), vol. 7713, pp. 431–442. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35527-1_36
Chapter Google Scholar
García-Hernández, R.A., Martínez-Trinidad, J.F., Carrasco-Ochoa, J.A.: A new algorithm for fast discovery of maximal sequential patterns in a document collection. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 514–523. Springer, Heidelberg (2006). https://doi.org/10.1007/11671299_53
Chapter Google Scholar
Ge, J., Xia, Y., Wang, J.: Mining uncertain sequential patterns in iterative mapreduce. In: Cao, T., Lim, E.-P., Zhou, Z.-H., Ho, T.-B., Cheung, D., Motoda, H. (eds.) PAKDD 2015. LNCS (LNAI), vol. 9078, pp. 243–254. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18032-8_19
Chapter Google Scholar
Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In: ACM SIGMOD Record, vol. 29, pp. 1–12. ACM (2000)
Google Scholar
Han, J., Pei, J., Yin, Y., Mao, R.: Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min. Knowl. Disc. 8(1), 53–87 (2004)
Article MathSciNet Google Scholar
Hong, T.P., Lin, K.Y., Wang, S.L.: Mining fuzzy sequential patterns from multiple-item transactions. In: 2001 Joint 9th IFSA World Congress and 20th NAFIPS International Conference, vol. 3, pp. 1317–1321. IEEE (2001)
Google Scholar
Hufsky, F., Kuchenbecker, L., Jahn, K., Stoye, J., Böcker, S.: Swiftly computing center strings. In: Moulton, V., Singh, M. (eds.) WABI 2010. LNCS, vol. 6293, pp. 325–336. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15294-8_27
Chapter Google Scholar
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943–955. Springer, Heidelberg (2003). https://doi.org/10.1007/3-540-45061-0_73
Chapter Google Scholar
Lan, G.C., Hong, T.P., Tseng, V.S., Wang, S.L.: Applying the maximum utility measure in high utility sequential pattern mining. Exp. Syst. Appl. 41(11), 5071–5081 (2014)
Article Google Scholar
Lanctot, J.K., Li, M., Ma, B., Wang, S., Zhang, L.: Distinguishing string selection problems. Inf. Comput. 185(1), 41–55 (2003)
Article MathSciNet Google Scholar
Liu, A.X., Shen, K., Torng, E.: Large scale hamming distance query processing. In: 2011 IEEE 27th International Conference on Data Engineering (ICDE), pp. 553–564. IEEE (2011)
Google Scholar
Lu, R., Jia, C., Zhang, S., Chen, L., Zhang, H.: An exact data mining method for finding center strings and all their instances. IEEE Trans. Knowl. Data Eng. 19(4), 509–522 (2007)
Article Google Scholar
Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 797–808. ACM (2013)
Google Scholar
Muzammal, M.: Mining sequential patterns from probabilistic databases by pattern-growth. In: Fernandes, A.A.A., Gray, A.J.G., Belhajjame, K. (eds.) BNCOD 2011. LNCS, vol. 7051, pp. 118–127. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24577-0_12
Chapter Google Scholar
Sahli, M., Mansour, E., Kalnis, P.: Parallel motif extraction from very long sequences. In: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, pp. 549–558. ACM (2013)
Google Scholar
Shang, J., Peng, J., Han, J.: Macfp: Maximal approximate consecutive frequent pattern mining under edit distance. In: Proceedings of the 2016 SIAM International Conference on Data Mining, pp. 558–566. SIAM (2016)
Google Scholar
Smith, A.D.: Common approximate substrings. Ph.D. thesis. Citeseer (2004)
Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Apers, P., Bouzeghoub, M., Gardarin, G. (eds.) EDBT 1996. LNCS, vol. 1057, pp. 1–17. Springer, Heidelberg (1996). https://doi.org/10.1007/BFb0014140
Chapter Google Scholar
Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. IEEE Trans. Knowl. Data Eng. 19(8), 1042–1056 (2007)
Article Google Scholar
Yang, X., Wang, Y., Wang, B., Wang, W.: Local filtering: improving the performance of approximate queries on string collections. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 377–392. ACM (2015)
Google Scholar
Yen, T.-F., Reiter, M.K.: Traffic aggregation for malware detection. In: Zamboni, D. (ed.) DIMVA 2008. LNCS, vol. 5137, pp. 207–227. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-70542-0_11
Chapter Google Scholar
Yin, J., Zheng, Z., Cao, L.: USpan: an efficient algorithm for mining high utility sequential patterns. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 660–668. ACM (2012)
Google Scholar

Download references

Acknowledgement

This work is supported by National Key Research and Development Program of China under grant 2016YFB1000902, National Natural Science Foundation of China (No. 61232015, 61472412, 61621003).

Author information

Authors and Affiliations

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Shuhan Zhang, Shengluan Hou & Chaoqun Fei
University of Chinese Academy of Sciences, Beijing, China
Shuhan Zhang, Shengluan Hou & Chaoqun Fei

Authors

Shuhan Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shengluan Hou
View author publications
You can also search for this author in PubMed Google Scholar
Chaoqun Fei
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shuhan Zhang .

Editor information

Editors and Affiliations

Zayed University, Dubai, United Arab Emirates
Hakim Hacid
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
University of Victoria, Footscray, VIC, Australia
Hua Wang
UNSW Australia, Sydney, NSW, Australia
Hye-Young Paik
Swinburne University of Technology, Hawthorn, VIC, Australia
Rui Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, S., Hou, S., Fei, C. (2018). Cpriori: An Index-Based Framework to Extract the Generalized Center Strings. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11233. Springer, Cham. https://doi.org/10.1007/978-3-030-02922-7_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-02922-7_32
Published: 20 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02921-0
Online ISBN: 978-3-030-02922-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics