Skip to main content

Most Significant Substring Mining Based on Chi-square Measure

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6118))

Included in the following conference series:

Abstract

Given the vast reservoirs of sequence data stored worldwide, efficient mining of string databases such as intrusion detection systems, player statistics, texts, proteins, etc. has emerged as a great challenge. Searching for an unusual pattern within long strings of data has emerged as a requirement for diverse applications. Given a string, the problem then is to identify the substrings that differs the most from the expected or normal behavior, i.e., the substrings that are statistically significant (i.e., less likely to occur due to chance alone). To this end, we use the chi-square measure and propose two heuristics for retrieving the top-k substrings with the largest chi-square measure. We show that the algorithms outperform other competing algorithms in the runtime, while maintaining a high approximation ratio of more than 0.96.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Denise, A., Regnier, M., Vandenbogaert, M.: Accessing the statistical significance of overrepresented oligonucleotides. In: Work. Alg. Bioinf. (WABI), pp. 85–97 (2001)

    Google Scholar 

  2. Ye, N., Chen, Q.: An anomaly detection technique based on chi-square statistics for detecting intrusions into information systems. Quality and Reliability Engineering International 17(2), 105–112 (2001)

    Article  MathSciNet  Google Scholar 

  3. Rahmann, S.: Dynamic programming algorithms for two statistical problems in computational biology. In: Work. Alg. Bioinf. (WABI), pp. 151–164 (2003)

    Google Scholar 

  4. Regnier, M., Vandenbogaert, M.: Comparison of statistical significance criteria. J. Bioinformatics and Computational Biology 4(2), 537–551 (2006)

    Article  Google Scholar 

  5. Bejerano, G., Friedman, N., Tishby, N.: Efficient exact p-value computation for small sample, sparse and surprisingly categorical data. J. Comp. Bio. 11(5), 867–886 (2004)

    Google Scholar 

  6. Read, T., Cressie, N.: Goodness-of-fit statistics for discrete multivariate data. Springer, Heidelberg (1988)

    MATH  Google Scholar 

  7. Read, T., Cressie, N.: Pearson’s χ 2 and the likelihood ratio statistic G 2: a comparative review. International Statistical Review 57(1), 19–43 (1989)

    Article  MATH  Google Scholar 

  8. Hotelling, H.: Multivariate quality control. Techniques of Statistical Analysis 54, 111–184 (1947)

    Google Scholar 

  9. Agarwal, S.: On finding the most statistically significant substring using the chi-square measure. Master’s thesis, Indian Institute of Technology, Kanpur (2009)

    Google Scholar 

  10. Keogh, E., Lonardi, S., Chiu, B.: Finding surprising patterns in a time series database in linear time and space. In: SIGKDD, pp. 550–556 (2002)

    Google Scholar 

  11. Dutta, S., Bhattacharya, A.: Mining most significant substrings based on the chi-square measure. arXiv:1002.4315 [cs.DB] (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dutta, S., Bhattacharya, A. (2010). Most Significant Substring Mining Based on Chi-square Measure. In: Zaki, M.J., Yu, J.X., Ravindran, B., Pudi, V. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2010. Lecture Notes in Computer Science(), vol 6118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13657-3_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13657-3_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13656-6

  • Online ISBN: 978-3-642-13657-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics