Improvement of the State Merging Rule on Noisy Data in Probabilistic Grammatical Inference

  • Amaury Habrard
  • Marc Bernard
  • Marc Sebban
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2837)


In this paper we study the influence of noise in probabilistic grammatical inference. We paradoxically bring out the idea that specialized automata deal better with noisy data than more general ones. We propose then to replace the statistical test of the Alergia algorithm by a more restrictive merging rule based on a test of proportion comparison. We experimentally show that this way to proceed allows us to produce larger automata that better treat noisy data, according to two different performance criteria (perplexity and distance to the target model).


probabilistic grammatical inference noisy data statistical approaches 


  1. 1.
    Brodley, C., Friedl, M.: Identifying and eliminating mislabeled training instances. In: Thirteeth National Conference on Artificial Intelligence AAAI/IAAI, vol. 1, pp. 799–805 (1996) Google Scholar
  2. 2.
    John, G., Kohavi, R., Pfleger, K.: Irrelevant features and the subset selection problem. In: 11th International Conference on Machine Learning, pp. 121–129 (1994)Google Scholar
  3. 3.
    Honavar, V., de la Higuera, C.: Introduction. Machine Learning Journal 44, 5–7 (2001)CrossRefGoogle Scholar
  4. 4.
    Carrasco, R., Oncina, J.: Learning stochastic regular grammars by means of a state merging method. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS (LNAI), vol. 862, pp. 139–150. Springer, Heidelberg (1994)Google Scholar
  5. 5.
    Carrasco, R., Oncina, J.: Learning deterministic regular grammars from stochastic samples in polynomial time. RAIRO (Theoretical Informatics and Applications) 33, 1–20 (1999)zbMATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Hoeffding, W.: Probabilities inequalities for sums or bounded random variables. Journal of the American Association 58, 13–30 (1963)zbMATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Kermorvant, C., Dupont, P.: Stochastic grammatical inference with multinomial tests. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 149–160. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  8. 8.
    Ron, D., Singer, Y., Tishby, N.: On the learnability and usage of acyclic probabilistic automata. In: Computational Learning Theory, COLT 1995, pp. 31–40 (1995)Google Scholar
  9. 9.
    Thollard, F., Dupont, P., de la Higuera, C.: Probabilistic dfa inference using kullback–leibler divergence and minimality. In: Kauffman, M. (ed.) Proceedings of the Seventeenth International Conference on Machine Learning, pp. 975–982 (2000)Google Scholar
  10. 10.
    Lyngsø, R., Pedersen, C., Nielsen, H.: Metrics and similarity measures for hidden Markov models. In: 7th International Conference on Intelligent Systems for Molecular Biology, ISMB 1999 Proceedings, Heidelberg, Germany, pp. 178–186. AAAI Press, Menlo Park (1999)Google Scholar
  11. 11.
    Reber, A.: Implicit learning of artificial grammars. Journal of verbal learning and verbal behaviour 6, 855–863 (1967)CrossRefGoogle Scholar
  12. 12.
    Hirschman, L., Bates, M., Dahl, D., Fisher, W., Garofolo, J., Hunicke-Smith, K., Pallett, D., Pao, C., Price, P., Rudnicky, A.: Multi-site data collection for a spoken language corpus. In: Proc. DARPA Speech and Natural Language Workshop 1992, Harriman, New York, pp. 7–14 (1992)Google Scholar
  13. 13.
    Blake, C., Merz, C.: University of California Irvine repository of machine learning databases (1998),
  14. 14.
    Habrard, A., Bernard, M., Jacquenet, F.: Generalized stochastic tree automata for multi-relational data mining. In: Adriaans, P.W., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 120–133. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Comon, H., Dauchet, M., Gilleron, R., Jacquemard, F., Lugiez, D., Tison, S., Tommasi, M.: Tree Automata Techniques and Applications (1997), Available on
  16. 16.
    Carrasco, R., Oncina, J., Calera-Rubio, J.: Stochastic Inference of Regular Tree Languages. Machine Learning 44, 185–197 (2001)zbMATHCrossRefGoogle Scholar
  17. 17.
    Habrard, A., Bernard, M., Jaquenet, F.: Mining probabilistic tree patterns in a medical database. In: Discovery Challenge of the 6th Conference PKDD 2002 (2002)Google Scholar
  18. 18.
    Bernard, M., Habrard, A.: Learning stochastic logic programs. In: Rouveirol, C., Sebag, M. (eds.) Work-in-Progress Track at the 11th International Conference on Inductive Logic Programming, pp. 19–26 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Amaury Habrard
    • 1
  • Marc Bernard
    • 1
  • Marc Sebban
    • 1
  1. 1.EURISE – Université Jean Monnet de Saint-EtienneSaint-Etienne cedex 2France

Personalised recommendations