Skip to main content

Mining Uncertain Sequence Data on Hadoop Platform

  • Conference paper
  • First Online:
Trends and Applications in Knowledge Discovery and Data Mining (PAKDD 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8643))

Included in the following conference series:

Abstract

Sequence pattern mining is the mining of special and representative features hidden in sequence data. Recently, it has been attracting a lot of attention, especially in the fields of bioinformatics and spatio-temporal trajectory mining. Observing that many sequence data are born with uncertainties and huge sequence data are increasingly generated and accumulated, this paper aims to discover the hidden features from a large amount of uncertain sequence data. Specifically, Probabilistic Suffix Tree (PST) is an implementation of Variable-length Markov Chain (VMM) that has been widely applied in sequence data mining. However, the conventional PST construction algorithm is not for the mining of uncertain data and cannot bear the computing of huge data. Thus, to mine a large amount of sequence data with uncertainties, this paper proposes the uPST\(_{MR}^+\) algorithm on the Hadoop platform to fully utilize the computing power and storage capacity of cloud computing. The proposed uPST\(_{MR}^+\) algorithm constructs a PST in a progressive, multi-layered, and iterative manner so as to avoid excessive learning patterns and balance the overhead of distributed computing. In addition, to prevent the drag on overall performance owing to multiple scanning of the entire sequence data, we trade space for time by using a NodeArray data structure to store the intermediate statistical results to reduce disk I/O. To verify the performance of uPST\(_{MR}^{+}\), we conduct several experiments. The experimental results show that uPST\(_{MR}^{+}\) outperforms the naive approach significantly and show good scalability and stability. Also, although using NodeArray costs a little extra memory, the execution time is significantly lowered.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Roux, C., Bernard, R.T.F.: Home range size, spatial distribution and habitat use of elephants in two enclosed game reserves in the eastern cape province, south Africa. Afr. J. Ecol. 47(2), 146–153 (2007)

    Article  Google Scholar 

  2. Jin, J.-F., Liu, B.-F., Yu, X., Lu, C.-H.: Wintering and migration of black-faced spoonbill in Xinghua Bay, Fujian Province. Chin. J. Zool. 44(1), 47–53 (2009)

    Google Scholar 

  3. Wang, Y., Lim, E.-P., Hwang, S.-Y.: Efficient mining of group patterns from user movement data. DKE 57(3), 240–282 (2006)

    Article  Google Scholar 

  4. Li, Y., Han, J., Yang, J.: Clustering moving objects. In: ACM SIGKDD, pp. 617–622 (2004)

    Google Scholar 

  5. Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE TKDE 21(5), 609–623 (2008)

    Google Scholar 

  6. Gezici, S.: A survey on wireless position estimation. Wireless Pers. Commun. 44(3), 263–282 (2007)

    Article  Google Scholar 

  7. Gu, Y., Lo, A., Niemegeers, I.: A survey of indoor positioning systems for wireless personal networks. IEEE Commun. Surv. Tutorials 11(1), 13–32 (2009)

    Article  Google Scholar 

  8. Wang, J., Luo, Y., Zhao, Y., Le, J.: A survey on privacy preserving data mining. In: 1st International Workshop on Database Technology and Applications (2009)

    Google Scholar 

  9. Chen, R., Fung, B.C.M., Mohammed, N., Desai, B.C., Wang, K.: Privacy-preserving trajectory data publishing by local suppression. Inf. Sci. Spec. Issue Data Min. Inf. Secur. 231, 83–91 (2013)

    MATH  Google Scholar 

  10. Kuijpers, B., Othman, W.: Trajectory databases: data models, uncertainty and complete query languages. J. Comput. Syst. Sci. 76(7), 538–560 (2009)

    Article  MathSciNet  Google Scholar 

  11. Yang, J., Hu, M.: TrajPattern: mining sequential patterns from imprecise trajectories of mobile objects. In: Ioannidis, Y., Scholl, M.H., Schmidt, J.W., Matthes, F., Hatzopoulos, M., Böhm, K., Kemper, A., Grust, T., Böhm, C. (eds.) EDBT 2006. LNCS, vol. 3896, pp. 664–681. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Leung, C.K.-S., Brajczuk, D.A.: Efficient algorithms for mining constrained frequent patterns from uncertain data. In: ACM SIGKDD Workshop on knowledge, Discovery from Uncertain Data (2009)

    Google Scholar 

  13. Ron, D., Singer, Y., Tishby, N.: Learning probabilistic automata with variable memory length. In: 7th Annual Conference on Computational Learning Theory, pp. 35–46 (1994)

    Google Scholar 

  14. Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)

    Article  Google Scholar 

  15. Sun, P., Chawla, S., Arunasalam, B.: Mining for outliers in sequential databases. In: SDM (2006)

    Google Scholar 

  16. Pampapathi, R.M.: Annotated suffix trees for text modelling and classification. Dissertation, University of London (2008)

    Google Scholar 

  17. Dubnov, S., Assayag, G., Lartillot, O., Bejerano, G.: Using machine-learning methods for musical style modeling. J. Comput. 36(10), 73–80 (2003)

    Article  Google Scholar 

  18. Ghoting, A., Makarychev, K.: I/O efficient algorithms for serial and parallel suffix tree construction. ACM TODS 35(4), 1–37 (2010)

    Article  Google Scholar 

  19. Gao, F., Zaki, M.J.: Psist: a scalable approach to indexing protein structures using suffix trees. J. Parallel Distrib. Comput. 68(1), 54–63 (2008)

    Article  MATH  Google Scholar 

  20. Pellicer, S., Chen, G., Chan, K.C.C., Pan, Y.: Distributed sequence alignment applications for the public computing architecture. IEEE T-NB 7(1), 35–43 (2008)

    Google Scholar 

  21. Tsai, H.-P., Yang, D.-N., Chen, M.-S.: Mining group movement patterns for tracking moving objects efficiently. IEEE TKDE 23(2), 266–281 (2011)

    Google Scholar 

Download references

Acknowledgment

The work was supported in part by the National Science Council of Taiwan, R.O.C., under Contracts NSC102-2220-E-005-007.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hsiao-Ping Tsai .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Sun, ZY., Tsai, MC., Tsai, HP. (2014). Mining Uncertain Sequence Data on Hadoop Platform. In: Peng, WC., et al. Trends and Applications in Knowledge Discovery and Data Mining. PAKDD 2014. Lecture Notes in Computer Science(), vol 8643. Springer, Cham. https://doi.org/10.1007/978-3-319-13186-3_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13186-3_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13185-6

  • Online ISBN: 978-3-319-13186-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics