Skip to main content

Data Complexity Issues in Grammatical Inference

  • Chapter
Data Complexity in Pattern Recognition

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

Summary

Grammatical inference (also known as grammar induction) is a field transversal to a number of research areas including machine learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology, and speech recognition. Specificities of the problems that are studied include those related to data complexity. We argue that there are three levels at which data complexity for grammatical inference can be studied: at the first (inner) level the data can be strings, trees, or graphs; these are nontrivial objects on which topologies may not always be easy to manage. A second level is concerned with the classes and the representations of the classes used for classification; formal language theory provides us with an elegant setting based on rewriting systems and recursivity, but which is not easy to work with for classification or learning tasks. The combinatoric problems usually attached to these tasks prove to be indeed difficult. The third level relates the objects to the classes. Membership may be problematic, and this is even more the case when approximations (of the strings or the languages) are used, for instance in a noisy setting. We argue that the main difficulties arise from the fact that the structural definitions of the languages and the topological measures do not match.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. Adriaans, H. Fernau, M. van Zaannen, eds. Grammatical Inference: Algorithms and Applications, Proceedings of ICGI’ 02, volume 2484 of LNAI. Berlin, Heidelberg: Springer-Verlag, 2002.

    Google Scholar 

  2. A. Aho, J. D. Ullman. The Theory of Parsing, Translation and Compiling, Vol 1: Parsing. Englewood Cliffs, NJ: Prentice-Hall, 1972.

    Google Scholar 

  3. A. V. Aho. Algorithms for Finding Patterns in Strings. Handbook of Theoretical Computer Science, pages 290–300. Amsterdam: Elsevier, 1990.

    Google Scholar 

  4. D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39, 337–350, 1978.

    Article  MathSciNet  Google Scholar 

  5. D. Angluin. Finding patterns common to a set of strings. In Conference record of the eleventh annual ACM Symposium on Theory of Computing, pages 130–141. New York: ACM Press, 1979.

    Chapter  Google Scholar 

  6. D. Angluin. Queries and concept learning. Machine Learning Journal, 2, 319–342, 1987.

    Google Scholar 

  7. D. Angluin. Identifying languages from stochastic examples. Technical Report YALEU/DCS/RR-614, Yale University, March 1988.

    Google Scholar 

  8. D. Angluin. Negative results for equivalence queries. Machine Learning Journal, 5, 121–150, 1990.

    Google Scholar 

  9. A. Brazma, I. Jonassen, J. Vilo, E. Ukkonen. Pattern discovery in biosequences. In Honavar and Slutski [37], pages 257–270.

    Chapter  Google Scholar 

  10. H. Bunke, A. Sanfeliu, eds. Syntactic and Structural Pattern Recognition, Theory and Applications, volume 7 of Series in Computer Science. Singapore: World Scientific, 1990.

    MATH  Google Scholar 

  11. J. Calera-Rubio, R.C. Carrasco. Computing the relative entropy between regular tree languages. Information Processing Letters, 68(6), 283–289, 1998.

    Article  MathSciNet  Google Scholar 

  12. R.C. Carrasco. Accurate computation of the relative entropy between stochastic regular grammars. RAIRO (Theoretical Informatics and Applications), 31(5), 437–444, 1997.

    MathSciNet  Google Scholar 

  13. R.C. Carrasco, J. Oncina, eds. Grammatical Inference and Applications, Proceedings of ICGI’ 94, number 862 in LNAI, Berlin: Springer, 1994.

    Google Scholar 

  14. R.C. Carrasco, J. Oncina. Learning stochastic regular grammars by means of a state merging method. In ICGI’94 [13], pages 139–150.

    Google Scholar 

  15. R.C. Carrasco, J. Oncina, J. Calera-Rubio. Stochastic inference of regular tree languages. Machine Learning Journal, 44(1), 185–197, 2001.

    Article  Google Scholar 

  16. Z. Chi, S. Geman. Estimation of probabilistic context-free grammars. Computational Linguistics, 24(2), 298–305, 1998.

    MathSciNet  Google Scholar 

  17. T. Cover, J. Thomas. Elements of Information Theory. New York: John Wiley, 1991.

    MATH  Google Scholar 

  18. C. de la Higuera, J-C. Janodet. Inference of ω-languages from prefixes. Theoretical Computer Science, 313(2), 295–312, 2004.

    Article  MathSciNet  Google Scholar 

  19. C. de la Higuera. Characteristic sets for polynomial grammatical inference. Machine Learning Journal, 27, 125–138, 1997.

    Article  Google Scholar 

  20. C. de la Higuera. Current trends in grammatical inference. In F.J. Ferri et al., eds. Advances in Pattern Recognition, Joint IAPR InternationalWorkshops SSPR+SPR 2000, olume 1876 of LNCS, pages 28–31. New York: Springer-Verlag, 2000.

    Google Scholar 

  21. C. de la Higuera, P. Adriaans, M. van Zaanen, J. Oncina, eds. Proceedings of the Workshop and Tutorial on Learning Context-free grammars. ISBN 953-6690-39-X, 2003.

    Google Scholar 

  22. C. de la Higuera, F. Casacuberta. Topology of strings: median string is NP-complete. Theoretical Computer Science, 230, 39–48, 2000.

    Article  MathSciNet  Google Scholar 

  23. C. de la Higuera, J. Oncina, E. Vidal. Identification of DFA: data-dependent versus data-independent algorithm. In Miclet and de la Higuera [59], pages 313–325.

    Google Scholar 

  24. A. de Oliveira, ed. Grammatical Inference: Algorithms and Applications, Proceedings of ICGI’ 00, volume 1891 of LNAI, Berlin: Springer, 2000.

    Google Scholar 

  25. T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann. Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries. In M. Li, A. Maruoka, eds. Proceedings of ALT’ 97, volume 1316 of LNCS, pages 260–276, Berlin: Springer, 1997.

    Google Scholar 

  26. H. Fernau. Learning tree languages from text. In J. Kivinen, R.H. Sloan, eds. Proceedings of COLT 2002, number 2375 in LNAI, pages 153–168, Berlin: Springer, 2002.

    Google Scholar 

  27. K.S. Fu. Syntactic Methods in Pattern Recognition. New York: Academic Press, 1974.

    MATH  Google Scholar 

  28. K.S. Fu, T.L. Booth. Grammatical inference: Introduction and survey. Part I and II. IEEE Transactions on Syst. Man. and Cybern., 5, 59–72, 409–423, 1975.

    Google Scholar 

  29. P. García, J. Oncina. Inference of recognizable tree sets. Technical Report DSICII/ 47/93, Departamento de Lenguajes y Sistemas Informáticos, Universidad Politécnica de Valencia, Spain, 1993.

    Google Scholar 

  30. R. Gavaldà. On the power of equivalence queries. In Proceedings of the 1st European Conference on Computational Learning Theory, volume 53 of The Institute of Mathematics and its Applications Conference Series, new series, pages 193–203. Oxford: Oxford University Press, 1993.

    Google Scholar 

  31. J.Y. Giordano. Inference of context-free grammars by enumeration: Structural containment as an ordering bias. In Carrasco and Oncina [13], pages 212–221.

    Google Scholar 

  32. E.M. Gold. Complexity of automaton identification from given data. Information and Control, 37, 302–320, 1978.

    Article  MathSciNet  Google Scholar 

  33. R. Gonzalez and M. Thomason. Syntactic Pattern Recognition: an Introduction. Reading MA: Addison-Wesley, 1978.

    MATH  Google Scholar 

  34. A. Habrard, M. Bernard, F. Jacquenet. Generalized stochastic tree automata for multirelational data mining. In Adriaans et al. [1], pages 120–133.

    Google Scholar 

  35. M. H. Harrison. Introduction to Formal Language Theory. Reading, MA: Addison-Wesley, 1978.

    MATH  Google Scholar 

  36. V. Honavar, C. de la Higuera. Introduction. Machine Learning Journal, 44(1), 5–7, 2001.

    Article  Google Scholar 

  37. V. Honavar, G. Slutski, eds. Grammatical Inference, Proceedings of ICGI’ 98, number 1433 in LNAI, Berlin: Springer-Verlag, 1998.

    Google Scholar 

  38. J.E. Hopcroft, J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley, 1979.

    MATH  Google Scholar 

  39. A. Jagota, R.B. Lyngsó, C.N.S. Pedersen. Comparing a hidden Markov model and a stochastic context-free grammar. In Proceedings of WABI’ 01, number 2149 in LNCS, pages 69–74, Berlin: Springer-Verlag, 2001.

    Google Scholar 

  40. T. Kammeyer, R.K. Belew. Stochastic context-free grammar induction with a genetic algorithm using local search. In R.K. Belew, M. Vose, eds. Foundations of Genetic Algorithms IV, San Mateo, CA: Morgan Kaufmann, 1996.

    Google Scholar 

  41. M. Kearns, L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. In 21st ACM Symposium on Theory of Computing, pages 433–444, 1989.

    Google Scholar 

  42. M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. Cambridge, MA: MIT press, 1994.

    Google Scholar 

  43. T. Knuutila, M. Steinby. Inference of tree languages from a finite sample: an algebraic approach. Theoretical Computer Science, 129, 337–367, 1994.

    Article  MathSciNet  Google Scholar 

  44. T. Koshiba, E. Makinen, Y. Takada. Inferring pure context-free languages from positive data. Acta Cybernetica, 14(3), 469–477, 2000.

    MathSciNet  Google Scholar 

  45. S.C. Kremer. Parallel stochastic grammar induction. In Proceedings of the 1997 International Conference on Neural Networks (ICNN’ 97), volume I, pages 612–616, 1997.

    Google Scholar 

  46. S.S. Kwerk. Colt: Computational learning theory. http://www.learningtheory.org, 1999.

    Google Scholar 

  47. K. Lang, B.A. Pearlmutter. The Abbadingo one DFA learning competition, 1997.

    Google Scholar 

  48. K.J. Lang, B.A. Pearlmutter, R.A. Price. Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In Honavar and Slutski [37], pages 1–12.

    Chapter  Google Scholar 

  49. P. Langley, S. Stromsten. Learning context-free grammars with a simplicity bias. In Proceedings of ECML 2000, 11th European Conference on Machine Learning,, volume 1810 of LNCS, pages 220–228. New York: Springer-Verlag, 2000.

    Google Scholar 

  50. K. Lari, S.J. Young. The estimation of stochastic context free grammars using the inside-outside algorithm. Computer Speech and Language, 4, 35–56, 1990.

    Article  Google Scholar 

  51. S. Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96. Cambridge, MA: Center for Research in Computing Technology, Harvard University Press, 1996.

    Google Scholar 

  52. N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear threshold. Machine Learning Journal, 2, 285–318, 1987.

    Google Scholar 

  53. R.B. Lyngsø, C.N.S. Pedersen. The consensus string problem and the complexity of comparing hidden markov models. Journal of Computing and System Science, 65(3), 545–569, 2002.

    Article  Google Scholar 

  54. E. Mäkinen. A note on the grammatical inference problem for even linear languages. Fundamenta Informaticae, 25(2), 175–182, 1996.

    MathSciNet  Google Scholar 

  55. O. Maler, A. Pnueli. On the learnability of infinitary regular sets. In Proceedings of COLT, pages 128–136, San Mateo, CA: Morgan Kaufmann, 1991.

    Google Scholar 

  56. F. Maryanski, M. G. Thomason. Properties of stochastic syntax-directed translation schemata. International Journal of Computer and information Science, 8(2), 89–110, 1979.

    Article  MathSciNet  Google Scholar 

  57. L. Miclet. Structural Methods in Pattern Recognition. New York: Chapman and Hall, 1986.

    MATH  Google Scholar 

  58. L. Miclet. Syntactic and Structural Pattern Recognition, Theory and Applications, In Grammatical Inference. pages 237–290. Singapore: World Scientific, 1990.

    Google Scholar 

  59. L. Miclet, C. de la Higuera, eds. Proceedings of ICGI’ 96, number 1147 in LNAI, Berlin, Heidelberg: Springer-Verlag, 1996.

    Google Scholar 

  60. T. M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.

    MATH  Google Scholar 

  61. F. Moreno-Seco, L. Micó, J. Oncina. A modification of the LAESA algorithm for approximated k-nn classification. Pattern Recognition Letters, 24(1–3), 47–53, 2003.

    Article  Google Scholar 

  62. T. Murgue, C. de la Higuera. Distances between distributions: Comparing language models. In A. Fred, T. Caelli, R. Duin, A. Campilho, D. de Ridder, eds. Structural, Syntactic and Statistical Pattern Recognition, Proceedings of SSPR and SPR 2004, volume 3138 of LNCS, pages 269–277. New York: Springer-Verlag, 2004.

    Google Scholar 

  63. B.L. Natarajan. Machine Learning: a Theoretical Approach. San Mateo, CA: Morgan Kaufmann, 1991.

    Google Scholar 

  64. H. Ney. Stochastic grammars and pattern recognition. In P. Laface, R. De Mori, eds. Proceedings of the NATO Advanced Study Institute, pages 313–344. New York: Springer-Verlag, 1992.

    Google Scholar 

  65. J. Oncina, P. García. Identifying regular languages in polynomial time. In H. Bunke, ed. Advances in Structural and Syntactic Pattern Recognition, volume 5 of Series in Machine Perception and Artificial Intelligence, pages 99–108. Singapore: World Scientific, 1992.

    Google Scholar 

  66. R. J. Parekh, C. Nichitiu, V. Honavar. A polynomial time incremental algorithm for learning DFA. In Honavar and Slutski [37], pages 37–49.

    Chapter  Google Scholar 

  67. A. Paz. Introduction to Probabilistic Automata. New York: Academic Press, 1971.

    MATH  Google Scholar 

  68. L. Pitt. Inductive inference, DFA’s, and computational complexity. In Analogical and Inductive Inference, number 397 in LNAI, pages 18–44. Berlin, Heidelberg: Springer-Verlag, 1989.

    Google Scholar 

  69. L. Pitt, M. Warmuth. Reductions among prediction problems: on the difficulty of predicting automata. In 3rd Conference on Structure in Complexity Theory, pages 60–69, 1988.

    Google Scholar 

  70. L. Pitt, M. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the Association for Computing Machinery, 40(1), 95–142, 1993.

    MathSciNet  Google Scholar 

  71. M.O. Rabin. Probabilistic automata. Information and Control, 6, 230–245, 1966.

    Article  Google Scholar 

  72. J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco. Stochastic k-testable tree languages and applications. In Adriaans et al. [1], pages 199–212.

    Google Scholar 

  73. J. R. Rico-Juan, L. Mic ’o. Comparison of AESA and LAESA search algorithms using string and tree-edit-distances. Pattern Recognition Letters, 24(9–10), 1417–1426, 2003.

    Article  Google Scholar 

  74. Y. Sakakibara. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76, 223–242, 1990.

    Article  MathSciNet  Google Scholar 

  75. Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97, 23–60, 1992.

    Article  MathSciNet  Google Scholar 

  76. Y. Sakakibara. Recent advances of grammatical inference. Theoretical Computer Science, 185, 15–45, 1997.

    Article  MathSciNet  Google Scholar 

  77. Y. Sakakibara, M. Kondo. Ga-based learning of context-free grammars using tabular representations. In Proceedings of 16th International Conference on Machine Learning (ICML-99), pages 354–360, 1999.

    Google Scholar 

  78. Y. Sakakibara, H. Muramatsu. Learning context-free grammars from partially structured examples. In de Oliveira [24], pages 229–240.

    Google Scholar 

  79. A. Saoudi, T. Yokomori. Learning local and recognizable ω-languages and monadic logic programs. In Proceedings of EUROCOLT, LNCS, pages 157–169. New York: Springer-Verlag, 1993.

    Google Scholar 

  80. J. M. Sempere, P. García. A characterisation of even linear languages and its application to the learning problem. In Carrasco and Oncina [13], pages 38–44.

    Google Scholar 

  81. Y. Takada. Grammatical inference for even linear languages based on control sets. Information Processing Letters, 28(4), 193–199, 1988.

    Article  MathSciNet  Google Scholar 

  82. Y. Takada. A hierarchy of language families learnable by regular language learners. In Carrasco and Oncina [13], pages 16–24.

    Google Scholar 

  83. F. Thollard, P. Dupont, C. de la Higuera. Probabilistic DFA inference using Kullback-Leibler divergence and minimality. In Proc. 17th International Conf. on Machine Learning, pages 975–982. San Francisco, CA: Morgan Kaufmann, 2000.

    Google Scholar 

  84. B. Trakhtenbrot, Y. Barzdin. Finite Automata: Behavior and Synthesis. Amesterdam: North Holland, 1973.

    MATH  Google Scholar 

  85. M. van Zaanen. The grammatical inference homepage. http://eurise.univ-stetienne. fr/gi/gi.html, 2003.

    Google Scholar 

  86. K. Vanlehn, W. Ball. A version space approach to learning context-free grammars. Machine Learning Journal, 2, 39–74, 1987.

    Google Scholar 

  87. E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, R. C. Carrasco. Probabilistic finite state automata-part I. Pattern Analysis and Machine Intelligence, 27(7), 1013–1025, 2005.

    Article  Google Scholar 

  88. E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, R. C. Carrasco. Probabilistic finite state automata-part II. Pattern Analysis and Machine Intelligence, 27(7), 1026–1039, 2005.

    Article  Google Scholar 

  89. R. Wagner, M. Fisher. The string-to-string correction problem. Journal of the ACM, 21, 168–178, 1974.

    Article  Google Scholar 

  90. M. Warmuth. Towards representation independence in pac-learning. In K. P. Jantke, ed. Proceedings of AII’89, volume 397 of LNAI, pages 78–103. New York: Springer-Verlag, 1989.

    Google Scholar 

  91. C. S. Wetherell. Probabilistic languages: a review and some open questions. Computing Surveys, 12(4), 361–379, 1980.

    Article  MathSciNet  Google Scholar 

  92. T. Zeugmann. Alt series home page. http://www.tcs.muluebeck. de/pages/thomas/WALT/waltn.jhtml, 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer Verlag London Limited

About this chapter

Cite this chapter

de la Higuera, C. (2006). Data Complexity Issues in Grammatical Inference. In: Basu, M., Ho, T.K. (eds) Data Complexity in Pattern Recognition. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84628-172-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-1-84628-172-3_8

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-84628-171-6

  • Online ISBN: 978-1-84628-172-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics