Advertisement

Monotony and Surprise

  • Alberto ApostolicoEmail author
Chapter
Part of the Natural Computing Series book series (NCS)

Abstract

This paper reviews models and tools emerged in recent years in the author’s work in connection with the discovery of interesting or anomalous patterns in sequences. Whereas customary approaches to pattern discovery proceed from either a statistical or a syntactic characterization alone, the approaches described here present the unifying feature of combining these two descriptors in a solidly intertwined, composite paradigm, whereby both syntactic structure and occurrence lists concur to define and identify a pattern in a subject. In turn, this supports a natural notion of pattern saturation, which enables one to partition patterns into equivalence classes over intervals of monotonicity of commonly adopted scores, in such a way that the subset of class representatives, consisting solely of saturated patterns, suffices to account for all patterns in the subject. The benefits at the outset consist not only of an increased descriptive power, but especially of a mitigation of the often unmanageable roster of candidates unearthed in a discovery attempt, and of the daunting computational burden that goes with it.

The applications of this paradigm as highlighted here are believed to point to a largely unexpressed potential. The specific pattern structures and configurations described include solid character strings, strings with errors, consensus sequences consisting of intermixed solid and wild characters, co- and multiple occurrences, and association rules thereof, etc. It is also outlined how, from a dual perspective, these constructs support novel paradigms of data compression, which leads to succinct descriptors, kernels, classification, and clustering methods of possible broader interest. Although largely inspired by biological sequence analysis, the ideas presented here apply to sequences of general origin, and mostly generalize to higher aggregates such as arrays, trees, and special types of graphs.

Keywords

Association Rule Pattern Discovery Longe Common Subsequence Longe Common Subsequence Comput Biol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal R, Imielinski T, Swami A (1999) Mining association rules between sets of items in large databases. In: Proceedinngs of the ACM SIGMOD, Washington, DC, May 1993, pp 207–216 Google Scholar
  2. 2.
    Apostolico A (1985) The myriad virtues of subword trees. In: Apostolico A, Galil Z (eds) Combinatorial algorithms on words. NATO ASI series F, vol 12. Springer, Berlin, pp 85–96 Google Scholar
  3. 3.
    Apostolico A (2005) Of Lempel–Ziv–Welch parses with refillable gaps. In: Proceedings of IEEE DCC data compression conference, pp 338–347 Google Scholar
  4. 4.
    Apostolico A (1996) String editing and longest common subsequences. In: Rozenberg G, Salomaa A (eds) Handbook of formal languages, vol II. Springer, Berlin, pp 361–398 Google Scholar
  5. 5.
    Apostolico A, Bejerano G (2000) Optimal amnesic probabilistic automata or how to learn and classify proteins in linear time and space. J Comput Biol 7(3/4):381–393 CrossRefGoogle Scholar
  6. 6.
    Apostolico A, Bock ME, Lonardi S (2003) Monotony of surprise and large scale quest for unusual words. J Comput Biol 10(3–4):283–311 CrossRefGoogle Scholar
  7. 7.
    Apostolico A, Bock ME, Lonardi S, Xu X (2000) Efficient detection of unusual words. J Comput Biol 7(1–2):71–94. CrossRefGoogle Scholar
  8. 8.
    Apostolico A, Comin M, Parida L (2006) Mining, compressing and classifying with extensible motifs. BMC Algorithms Mol Biol 1(4):1–7 Google Scholar
  9. 9.
    Apostolico A, Comin M, Parida L (2004) Motifs in Ziv–Lempel–Welch clef. In: Proceedings of IEEE DCC data compression conference, pp 72–81 Google Scholar
  10. 10.
    Apostolico A, Comin M, Parida L (2005) Conservative extraction of overrepresented extensible motifs. In: Proceedings of ISMB 05, intelligent systems for molecular biology, Detroit, MI, pp 9–18 Google Scholar
  11. 11.
    Apostolico A, Comin M, Parida L (2006) Bridging lossy and lossless data compression by motif pattern discovery. In: Ahlswede R, Bäumer L, Cai N (eds) General theory of information transfer and combinatorics, vol II of Research report ZIF (Center of interdisciplinary studies) project, Bielefeld, October 1, 2002–August 31, 2003. Lecture notes in computer science, vol 4123. Springer, Berlin, pp 787–799 Google Scholar
  12. 12.
    Apostolico A, Cunial F (2009) The subsequence composition of a string. Theor. Comp. Sci. (in press) Google Scholar
  13. 13.
    Apostolico A, Cunial F, Kaul V (2008) Table compression by record intersection. In: Proceedings of IEEE DCC data compression conference, pp 11–22 Google Scholar
  14. 14.
    Apostolico A, Galil Z (eds) (1997) Pattern matching algorithms. Oxford University Press, Oxford zbMATHGoogle Scholar
  15. 15.
    Apostolico A, Lonardi S (2000) Off-line compression by greedy textual substitution. Proc IEEE 88(11):1733–1744 CrossRefGoogle Scholar
  16. 16.
    Apostolico A, Parida L (2004) Incremental paradigms for motif discovery. J Comput Biol 11(1):15–25 CrossRefGoogle Scholar
  17. 17.
    Apostolico A, Pizzi C (2004) Monotone scoring of patterns with mismatches. In: Proceedings of WABI. Lecture notes in computer science, vol 3240. Springer, Berlin, pp 87–98 Google Scholar
  18. 18.
    Apostolico A, Pizzi C, Satta G (2004) Optimal discovery of subword associations in strings. In: Proceedings of the 7th discovery science conference. Lecture notes in artificial intelligence, vol 3245. Springer, Berlin, pp 270–277 Google Scholar
  19. 19.
    Apostolico A, Preparata FP (1985) Structural properties of the string statistics problem. J Comput Syst Sci 31(3):394–411 zbMATHCrossRefMathSciNetGoogle Scholar
  20. 20.
    Apostolico A, Preparata FP (1996) Data structures and algorithms for the string statistics problem. Algorithmica 15:481–494 zbMATHCrossRefMathSciNetGoogle Scholar
  21. 21.
    Apostolico A, Satta G (2009) Discovering subword associations in strings in time linear in the output size. J Discrete Algorithms 7(2):227–238 zbMATHCrossRefGoogle Scholar
  22. 22.
    Apostolico A, Tagliacollo C (2008) Incremental discovery of irredundant motif bases for all suffixes of a string in O(|Σ|n2log n) time. Theor Comput Sci. doi: 10.1016/j.tcs.2008.08.002 MathSciNetGoogle Scholar
  23. 23.
    Blumer A, Blumer J, Ehrenfeucht A, Haussler D, Chen MT, Seiferas J (1985) The smallest automaton recognizing the subwords of a text. Theor Comput Sci 40:31–55 zbMATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Buchsbaum AL, Caldwell DF, Church KW, Fowler GS, Muthukrishnan S (2000) Engineering the compression of massive tables: an experimental approach. In: Proceedings of 11th ACM–SIAM symposium on discrete algorithms, San Francisco, CA, pp 175–184 Google Scholar
  25. 25.
    Buchsbaum AL, Fowler GS, Giancarlo R (2003) Improving table compression with combinatorial optimization. J ACM 50(6):825–851 CrossRefMathSciNetGoogle Scholar
  26. 26.
    Buhler J, Tompa M (2002) Finding motifs using random projections. J Comput Biol 9(2):225–242 CrossRefGoogle Scholar
  27. 27.
    Cole R, Gottlieb LA, Lewenstein M (2004) Dictionary matching and indexing with errors and don’t cares. Typescript Google Scholar
  28. 28.
    Colosimo A, De Luca A (2000) Special factors in biological strings. J Theor Biol 204:29–46 CrossRefGoogle Scholar
  29. 29.
    Cormack G (1985) Data compression in a data base system. Commun ACM 28(12):1336 CrossRefMathSciNetGoogle Scholar
  30. 30.
    Goldstein J, Ramakrishnan R, Shaft U (1998) Compressing relations and indexes. In: Proceedings of the 14th international conference on data engineering, pp 370–379 Google Scholar
  31. 31.
    Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge zbMATHGoogle Scholar
  32. 32.
    Johnson DS, Krishnan S, Chhugani J, Kumar S, Venkatasubramanian S (2004) Compressing large Boolean matrices using reordering techniques. In: Proceedings of the 30th international conference on very large databases (VLDB), pp 13–23 Google Scholar
  33. 33.
    Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 29:147–160 MathSciNetGoogle Scholar
  34. 34.
    Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15:563–577 CrossRefGoogle Scholar
  35. 35.
    Keich H, Pevzner P (2002) Finding motifs in the twilight zone. In: Annual international conference on computational molecular biology, Washington, DC, April 2002, pp 195–204 Google Scholar
  36. 36.
    Kolmogorov AN (1965) Three approaches to the quantitative definition of information. Probl Pederachi Inf 1 Google Scholar
  37. 37.
    Lehman E, Shelat A (2002) Approximation algorithms for grammar based compression. In: Proceedings of the eleventh ACM–SIAM symposium on discrete algorithms (SODA 2002), pp 205–212 Google Scholar
  38. 38.
    Lempel A, Ziv J (1976) On the complexity of finite sequences. IEEE Trans Inf Theory 22:75–81 zbMATHCrossRefMathSciNetGoogle Scholar
  39. 39.
    Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions and reversals. Sov Phys Dokl 6:707–710 MathSciNetGoogle Scholar
  40. 40.
    Levitt SD, Dubner William SJ (2005) Freakonomics: a rogue economist explores the hidden side of everything. Morrow Google Scholar
  41. 41.
    Martin-Lof P (1966) The definition of random sequences. Inf Control 9(6):602–619 CrossRefMathSciNetGoogle Scholar
  42. 42.
    Nevill-Manning CG, Witten IH (1999) Protein is incompressible. In: Proceedings of the IEEE data compression conference, pp 257–266 Google Scholar
  43. 43.
    Piatesky-Shapiro G, Frawley WJ (eds) (1991) Knowledge discovery in databases. AAAI Press/MIT Press, Menlo Park Google Scholar
  44. 44.
    Pisanti N, Crochemore M, Grossi R, Sagot M-F (2005) Bases of motifs for generating repeated patterns with wild cards. IEEE/ACM Trans Comput Biol Bioinform 2(1):40–50 CrossRefGoogle Scholar
  45. 45.
    Rigoutsos I, Floratos A, Parida L, Gao Y, Platt D (2000) The emergence of pattern discovery techniques in computational biology. J Metab Eng 2:159–177 CrossRefGoogle Scholar
  46. 46.
    Rissanen J (1986) Complexity of strings in the class of Markov sources. IEEE Trans Inf Theory 32(4):526–532 zbMATHCrossRefMathSciNetGoogle Scholar
  47. 47.
    Ron D, Singer Y, Tishby N (1996) The power of amnesia: learning probabilistic automata with variable memory length. Mach Learn 25:117–150 zbMATHCrossRefGoogle Scholar
  48. 48.
    Storer JA (1988) Data compression: methods and theory. Computer Science Press, New York Google Scholar
  49. 49.
    Takeda M, Fukuda T, Nanri I, Yamasaki ăM, Tamari ăK (2003) Discovering instances of poetic allusion from anthologies of classical Japanese poems. Theor Comput Sci 292(2):497–524 zbMATHCrossRefMathSciNetGoogle Scholar
  50. 50.
    Vo BD, Vo KP (2004) Using column dependency to compress tables. In: Proceedings of DCC 2004. IEEE Computer Society, Los Alamitos, pp 92–101 Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  1. 1.Dipartimento di Ingegneria dell’ InformazioneUniversità di PadovaPadovaItaly
  2. 2.College of ComputingGeorgia Institute of TechnologyAtlantaUSA

Personalised recommendations