Data Complexity Issues in Grammatical Inference

de la Higuera, Colin

doi:10.1007/978-1-84628-172-3_8

Colin de la Higuera³

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

1109 Accesses
5 Citations

Summary

Grammatical inference (also known as grammar induction) is a field transversal to a number of research areas including machine learning, formal language theory, syntactic and structural pattern recognition, computational linguistics, computational biology, and speech recognition. Specificities of the problems that are studied include those related to data complexity. We argue that there are three levels at which data complexity for grammatical inference can be studied: at the first (inner) level the data can be strings, trees, or graphs; these are nontrivial objects on which topologies may not always be easy to manage. A second level is concerned with the classes and the representations of the classes used for classification; formal language theory provides us with an elegant setting based on rewriting systems and recursivity, but which is not easy to work with for classification or learning tasks. The combinatoric problems usually attached to these tasks prove to be indeed difficult. The third level relates the objects to the classes. Membership may be problematic, and this is even more the case when approximations (of the strings or the languages) are used, for instance in a noisy setting. We argue that the main difficulties arise from the fact that the structural definitions of the languages and the topological measures do not match.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P. Adriaans, H. Fernau, M. van Zaannen, eds. Grammatical Inference: Algorithms and Applications, Proceedings of ICGI’ 02, volume 2484 of LNAI. Berlin, Heidelberg: Springer-Verlag, 2002.
Google Scholar
A. Aho, J. D. Ullman. The Theory of Parsing, Translation and Compiling, Vol 1: Parsing. Englewood Cliffs, NJ: Prentice-Hall, 1972.
Google Scholar
A. V. Aho. Algorithms for Finding Patterns in Strings. Handbook of Theoretical Computer Science, pages 290–300. Amsterdam: Elsevier, 1990.
Google Scholar
D. Angluin. On the complexity of minimum inference of regular sets. Information and Control, 39, 337–350, 1978.
Article MathSciNet Google Scholar
D. Angluin. Finding patterns common to a set of strings. In Conference record of the eleventh annual ACM Symposium on Theory of Computing, pages 130–141. New York: ACM Press, 1979.
Chapter Google Scholar
D. Angluin. Queries and concept learning. Machine Learning Journal, 2, 319–342, 1987.
Google Scholar
D. Angluin. Identifying languages from stochastic examples. Technical Report YALEU/DCS/RR-614, Yale University, March 1988.
Google Scholar
D. Angluin. Negative results for equivalence queries. Machine Learning Journal, 5, 121–150, 1990.
Google Scholar
A. Brazma, I. Jonassen, J. Vilo, E. Ukkonen. Pattern discovery in biosequences. In Honavar and Slutski [37], pages 257–270.
Chapter Google Scholar
H. Bunke, A. Sanfeliu, eds. Syntactic and Structural Pattern Recognition, Theory and Applications, volume 7 of Series in Computer Science. Singapore: World Scientific, 1990.
MATH Google Scholar
J. Calera-Rubio, R.C. Carrasco. Computing the relative entropy between regular tree languages. Information Processing Letters, 68(6), 283–289, 1998.
Article MathSciNet Google Scholar
R.C. Carrasco. Accurate computation of the relative entropy between stochastic regular grammars. RAIRO (Theoretical Informatics and Applications), 31(5), 437–444, 1997.
MathSciNet Google Scholar
R.C. Carrasco, J. Oncina, eds. Grammatical Inference and Applications, Proceedings of ICGI’ 94, number 862 in LNAI, Berlin: Springer, 1994.
Google Scholar
R.C. Carrasco, J. Oncina. Learning stochastic regular grammars by means of a state merging method. In ICGI’94 [13], pages 139–150.
Google Scholar
R.C. Carrasco, J. Oncina, J. Calera-Rubio. Stochastic inference of regular tree languages. Machine Learning Journal, 44(1), 185–197, 2001.
Article Google Scholar
Z. Chi, S. Geman. Estimation of probabilistic context-free grammars. Computational Linguistics, 24(2), 298–305, 1998.
MathSciNet Google Scholar
T. Cover, J. Thomas. Elements of Information Theory. New York: John Wiley, 1991.
MATH Google Scholar
C. de la Higuera, J-C. Janodet. Inference of ω-languages from prefixes. Theoretical Computer Science, 313(2), 295–312, 2004.
Article MathSciNet Google Scholar
C. de la Higuera. Characteristic sets for polynomial grammatical inference. Machine Learning Journal, 27, 125–138, 1997.
Article Google Scholar
C. de la Higuera. Current trends in grammatical inference. In F.J. Ferri et al., eds. Advances in Pattern Recognition, Joint IAPR InternationalWorkshops SSPR+SPR 2000, olume 1876 of LNCS, pages 28–31. New York: Springer-Verlag, 2000.
Google Scholar
C. de la Higuera, P. Adriaans, M. van Zaanen, J. Oncina, eds. Proceedings of the Workshop and Tutorial on Learning Context-free grammars. ISBN 953-6690-39-X, 2003.
Google Scholar
C. de la Higuera, F. Casacuberta. Topology of strings: median string is NP-complete. Theoretical Computer Science, 230, 39–48, 2000.
Article MathSciNet Google Scholar
C. de la Higuera, J. Oncina, E. Vidal. Identification of DFA: data-dependent versus data-independent algorithm. In Miclet and de la Higuera [59], pages 313–325.
Google Scholar
A. de Oliveira, ed. Grammatical Inference: Algorithms and Applications, Proceedings of ICGI’ 00, volume 1891 of LNAI, Berlin: Springer, 2000.
Google Scholar
T. Erlebach, P. Rossmanith, H. Stadtherr, A. Steger, T. Zeugmann. Learning one-variable pattern languages very efficiently on average, in parallel, and by asking queries. In M. Li, A. Maruoka, eds. Proceedings of ALT’ 97, volume 1316 of LNCS, pages 260–276, Berlin: Springer, 1997.
Google Scholar
H. Fernau. Learning tree languages from text. In J. Kivinen, R.H. Sloan, eds. Proceedings of COLT 2002, number 2375 in LNAI, pages 153–168, Berlin: Springer, 2002.
Google Scholar
K.S. Fu. Syntactic Methods in Pattern Recognition. New York: Academic Press, 1974.
MATH Google Scholar
K.S. Fu, T.L. Booth. Grammatical inference: Introduction and survey. Part I and II. IEEE Transactions on Syst. Man. and Cybern., 5, 59–72, 409–423, 1975.
Google Scholar
P. García, J. Oncina. Inference of recognizable tree sets. Technical Report DSICII/ 47/93, Departamento de Lenguajes y Sistemas Informáticos, Universidad Politécnica de Valencia, Spain, 1993.
Google Scholar
R. Gavaldà. On the power of equivalence queries. In Proceedings of the 1st European Conference on Computational Learning Theory, volume 53 of The Institute of Mathematics and its Applications Conference Series, new series, pages 193–203. Oxford: Oxford University Press, 1993.
Google Scholar
J.Y. Giordano. Inference of context-free grammars by enumeration: Structural containment as an ordering bias. In Carrasco and Oncina [13], pages 212–221.
Google Scholar
E.M. Gold. Complexity of automaton identification from given data. Information and Control, 37, 302–320, 1978.
Article MathSciNet Google Scholar
R. Gonzalez and M. Thomason. Syntactic Pattern Recognition: an Introduction. Reading MA: Addison-Wesley, 1978.
MATH Google Scholar
A. Habrard, M. Bernard, F. Jacquenet. Generalized stochastic tree automata for multirelational data mining. In Adriaans et al. [1], pages 120–133.
Google Scholar
M. H. Harrison. Introduction to Formal Language Theory. Reading, MA: Addison-Wesley, 1978.
MATH Google Scholar
V. Honavar, C. de la Higuera. Introduction. Machine Learning Journal, 44(1), 5–7, 2001.
Article Google Scholar
V. Honavar, G. Slutski, eds. Grammatical Inference, Proceedings of ICGI’ 98, number 1433 in LNAI, Berlin: Springer-Verlag, 1998.
Google Scholar
J.E. Hopcroft, J.D. Ullman. Introduction to Automata Theory, Languages, and Computation. Reading, MA: Addison-Wesley, 1979.
MATH Google Scholar
A. Jagota, R.B. Lyngsó, C.N.S. Pedersen. Comparing a hidden Markov model and a stochastic context-free grammar. In Proceedings of WABI’ 01, number 2149 in LNCS, pages 69–74, Berlin: Springer-Verlag, 2001.
Google Scholar
T. Kammeyer, R.K. Belew. Stochastic context-free grammar induction with a genetic algorithm using local search. In R.K. Belew, M. Vose, eds. Foundations of Genetic Algorithms IV, San Mateo, CA: Morgan Kaufmann, 1996.
Google Scholar
M. Kearns, L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. In 21st ACM Symposium on Theory of Computing, pages 433–444, 1989.
Google Scholar
M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. Cambridge, MA: MIT press, 1994.
Google Scholar
T. Knuutila, M. Steinby. Inference of tree languages from a finite sample: an algebraic approach. Theoretical Computer Science, 129, 337–367, 1994.
Article MathSciNet Google Scholar
T. Koshiba, E. Makinen, Y. Takada. Inferring pure context-free languages from positive data. Acta Cybernetica, 14(3), 469–477, 2000.
MathSciNet Google Scholar
S.C. Kremer. Parallel stochastic grammar induction. In Proceedings of the 1997 International Conference on Neural Networks (ICNN’ 97), volume I, pages 612–616, 1997.
Google Scholar
S.S. Kwerk. Colt: Computational learning theory. http://www.learningtheory.org, 1999.
Google Scholar
K. Lang, B.A. Pearlmutter. The Abbadingo one DFA learning competition, 1997.
Google Scholar
K.J. Lang, B.A. Pearlmutter, R.A. Price. Results of the Abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In Honavar and Slutski [37], pages 1–12.
Chapter Google Scholar
P. Langley, S. Stromsten. Learning context-free grammars with a simplicity bias. In Proceedings of ECML 2000, 11th European Conference on Machine Learning,, volume 1810 of LNCS, pages 220–228. New York: Springer-Verlag, 2000.
Google Scholar
K. Lari, S.J. Young. The estimation of stochastic context free grammars using the inside-outside algorithm. Computer Speech and Language, 4, 35–56, 1990.
Article Google Scholar
S. Lee. Learning of context-free languages: A survey of the literature. Technical Report TR-12-96. Cambridge, MA: Center for Research in Computing Technology, Harvard University Press, 1996.
Google Scholar
N. Littlestone. Learning quickly when irrelevant attributes abound: a new linear threshold. Machine Learning Journal, 2, 285–318, 1987.
Google Scholar
R.B. Lyngsø, C.N.S. Pedersen. The consensus string problem and the complexity of comparing hidden markov models. Journal of Computing and System Science, 65(3), 545–569, 2002.
Article Google Scholar
E. Mäkinen. A note on the grammatical inference problem for even linear languages. Fundamenta Informaticae, 25(2), 175–182, 1996.
MathSciNet Google Scholar
O. Maler, A. Pnueli. On the learnability of infinitary regular sets. In Proceedings of COLT, pages 128–136, San Mateo, CA: Morgan Kaufmann, 1991.
Google Scholar
F. Maryanski, M. G. Thomason. Properties of stochastic syntax-directed translation schemata. International Journal of Computer and information Science, 8(2), 89–110, 1979.
Article MathSciNet Google Scholar
L. Miclet. Structural Methods in Pattern Recognition. New York: Chapman and Hall, 1986.
MATH Google Scholar
L. Miclet. Syntactic and Structural Pattern Recognition, Theory and Applications, In Grammatical Inference. pages 237–290. Singapore: World Scientific, 1990.
Google Scholar
L. Miclet, C. de la Higuera, eds. Proceedings of ICGI’ 96, number 1147 in LNAI, Berlin, Heidelberg: Springer-Verlag, 1996.
Google Scholar
T. M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997.
MATH Google Scholar
F. Moreno-Seco, L. Micó, J. Oncina. A modification of the LAESA algorithm for approximated k-nn classification. Pattern Recognition Letters, 24(1–3), 47–53, 2003.
Article Google Scholar
T. Murgue, C. de la Higuera. Distances between distributions: Comparing language models. In A. Fred, T. Caelli, R. Duin, A. Campilho, D. de Ridder, eds. Structural, Syntactic and Statistical Pattern Recognition, Proceedings of SSPR and SPR 2004, volume 3138 of LNCS, pages 269–277. New York: Springer-Verlag, 2004.
Google Scholar
B.L. Natarajan. Machine Learning: a Theoretical Approach. San Mateo, CA: Morgan Kaufmann, 1991.
Google Scholar
H. Ney. Stochastic grammars and pattern recognition. In P. Laface, R. De Mori, eds. Proceedings of the NATO Advanced Study Institute, pages 313–344. New York: Springer-Verlag, 1992.
Google Scholar
J. Oncina, P. García. Identifying regular languages in polynomial time. In H. Bunke, ed. Advances in Structural and Syntactic Pattern Recognition, volume 5 of Series in Machine Perception and Artificial Intelligence, pages 99–108. Singapore: World Scientific, 1992.
Google Scholar
R. J. Parekh, C. Nichitiu, V. Honavar. A polynomial time incremental algorithm for learning DFA. In Honavar and Slutski [37], pages 37–49.
Chapter Google Scholar
A. Paz. Introduction to Probabilistic Automata. New York: Academic Press, 1971.
MATH Google Scholar
L. Pitt. Inductive inference, DFA’s, and computational complexity. In Analogical and Inductive Inference, number 397 in LNAI, pages 18–44. Berlin, Heidelberg: Springer-Verlag, 1989.
Google Scholar
L. Pitt, M. Warmuth. Reductions among prediction problems: on the difficulty of predicting automata. In 3rd Conference on Structure in Complexity Theory, pages 60–69, 1988.
Google Scholar
L. Pitt, M. Warmuth. The minimum consistent DFA problem cannot be approximated within any polynomial. Journal of the Association for Computing Machinery, 40(1), 95–142, 1993.
MathSciNet Google Scholar
M.O. Rabin. Probabilistic automata. Information and Control, 6, 230–245, 1966.
Article Google Scholar
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco. Stochastic k-testable tree languages and applications. In Adriaans et al. [1], pages 199–212.
Google Scholar
J. R. Rico-Juan, L. Mic ’o. Comparison of AESA and LAESA search algorithms using string and tree-edit-distances. Pattern Recognition Letters, 24(9–10), 1417–1426, 2003.
Article Google Scholar
Y. Sakakibara. Learning context-free grammars from structural data in polynomial time. Theoretical Computer Science, 76, 223–242, 1990.
Article MathSciNet Google Scholar
Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97, 23–60, 1992.
Article MathSciNet Google Scholar
Y. Sakakibara. Recent advances of grammatical inference. Theoretical Computer Science, 185, 15–45, 1997.
Article MathSciNet Google Scholar
Y. Sakakibara, M. Kondo. Ga-based learning of context-free grammars using tabular representations. In Proceedings of 16th International Conference on Machine Learning (ICML-99), pages 354–360, 1999.
Google Scholar
Y. Sakakibara, H. Muramatsu. Learning context-free grammars from partially structured examples. In de Oliveira [24], pages 229–240.
Google Scholar
A. Saoudi, T. Yokomori. Learning local and recognizable ω-languages and monadic logic programs. In Proceedings of EUROCOLT, LNCS, pages 157–169. New York: Springer-Verlag, 1993.
Google Scholar
J. M. Sempere, P. García. A characterisation of even linear languages and its application to the learning problem. In Carrasco and Oncina [13], pages 38–44.
Google Scholar
Y. Takada. Grammatical inference for even linear languages based on control sets. Information Processing Letters, 28(4), 193–199, 1988.
Article MathSciNet Google Scholar
Y. Takada. A hierarchy of language families learnable by regular language learners. In Carrasco and Oncina [13], pages 16–24.
Google Scholar
F. Thollard, P. Dupont, C. de la Higuera. Probabilistic DFA inference using Kullback-Leibler divergence and minimality. In Proc. 17th International Conf. on Machine Learning, pages 975–982. San Francisco, CA: Morgan Kaufmann, 2000.
Google Scholar
B. Trakhtenbrot, Y. Barzdin. Finite Automata: Behavior and Synthesis. Amesterdam: North Holland, 1973.
MATH Google Scholar
M. van Zaanen. The grammatical inference homepage. http://eurise.univ-stetienne. fr/gi/gi.html, 2003.
Google Scholar
K. Vanlehn, W. Ball. A version space approach to learning context-free grammars. Machine Learning Journal, 2, 39–74, 1987.
Google Scholar
E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, R. C. Carrasco. Probabilistic finite state automata-part I. Pattern Analysis and Machine Intelligence, 27(7), 1013–1025, 2005.
Article Google Scholar
E. Vidal, F. Thollard, C. de la Higuera, F. Casacuberta, R. C. Carrasco. Probabilistic finite state automata-part II. Pattern Analysis and Machine Intelligence, 27(7), 1026–1039, 2005.
Article Google Scholar
R. Wagner, M. Fisher. The string-to-string correction problem. Journal of the ACM, 21, 168–178, 1974.
Article Google Scholar
M. Warmuth. Towards representation independence in pac-learning. In K. P. Jantke, ed. Proceedings of AII’89, volume 397 of LNAI, pages 78–103. New York: Springer-Verlag, 1989.
Google Scholar
C. S. Wetherell. Probabilistic languages: a review and some open questions. Computing Surveys, 12(4), 361–379, 1980.
Article MathSciNet Google Scholar
T. Zeugmann. Alt series home page. http://www.tcs.muluebeck. de/pages/thomas/WALT/waltn.jhtml, 1999.
Google Scholar

Download references

Author information

Authors and Affiliations

EURISE, Facult des Sciences et Techniques, 23 rue du Docteur Paul Michelon, 42023, Saint-Etienne Cedex 2, France
Colin de la Higuera

Authors

Colin de la Higuera
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Electrical Engineering Department, City College, City University of New York, USA
Mitra Basu PhD
Bell Laboratories, Lucent Technologies, New Jersey, USA
Tin Kam Ho BBA, MS, PhD

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

de la Higuera, C. (2006). Data Complexity Issues in Grammatical Inference. In: Basu, M., Ho, T.K. (eds) Data Complexity in Pattern Recognition. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-84628-172-3_8

Download citation

DOI: https://doi.org/10.1007/978-1-84628-172-3_8
Publisher Name: Springer, London
Print ISBN: 978-1-84628-171-6
Online ISBN: 978-1-84628-172-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics