Learning Good Edit Similarities with Generalization Guarantees

  • Aurélien Bellet
  • Amaury Habrard
  • Marc Sebban
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6911)


Similarity and distance functions are essential to many learning algorithms, thus training them has attracted a lot of interest. When it comes to dealing with structured data (e.g., strings or trees), edit similarities are widely used, and there exists a few methods for learning them. However, these methods offer no theoretical guarantee as to the generalization performance and discriminative power of the resulting similarities. Recently, a theory of learning with (ε, γ,τ)-good similarity functions was proposed. This new theory bridges the gap between the properties of a similarity function and its performance in classification. In this paper, we propose a novel edit similarity learning approach (GESL) driven by the idea of (ε,γ,τ)-goodness, which allows us to derive generalization guarantees using the notion of uniform stability. We experimentally show that edit similarities learned with our method induce classification models that are both more accurate and sparser than those induced by the edit distance or edit similarities learned with a state-of-the-art method.


Edit Similarity Learning Good Similarity Functions 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Yang, L., Jin, R.: Distance Metric Learning: A Comprehensive Survey. Technical report, Dep. of Comp. Science and Eng., Michigan State University (2006)Google Scholar
  2. 2.
    Davis, J.V., Kulis, B., Jain, P., Sra, S., Dhillon, I.S.: Information-theoretic metric learning. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 209–216 (2007)Google Scholar
  3. 3.
    Weinberger, K.Q., Saul, L.K.: Distance Metric Learning for Large Margin Nearest Neighbor Classification. J. of Mach. Learn. Res. (JMLR) 10, 207–244 (2009)zbMATHGoogle Scholar
  4. 4.
    Jin, R., Wang, S., Zhou, Y.: Regularized distance metric learning: Theory and algorithm. In: Adv. in Neural Inf. Proc. Sys. (NIPS), pp. 862–870 (2009)Google Scholar
  5. 5.
    Ristad, E.S., Yianilos, P.N.: Learning String-Edit Distance. IEEE Trans. on Pattern Analysis and Machine Intelligence. 20, 522–532 (1998)CrossRefGoogle Scholar
  6. 6.
    Bilenko, M., Mooney, R.J.: Adaptive Duplicate Detection Using Learnable String Similarity Measures. In: Proc. of the Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pp. 39–48 (2003)Google Scholar
  7. 7.
    Oncina, J., Sebban, M.: Learning Stochastic Edit Distance: application in handwritten character recognition. Pattern Recognition 39(9), 1575–1587 (2006)CrossRefzbMATHGoogle Scholar
  8. 8.
    Bernard, M., Boyer, L., Habrard, A., Sebban, M.: Learning probabilistic models of tree edit distance. Pattern Recognition 41(8), 2611–2629 (2008)CrossRefzbMATHGoogle Scholar
  9. 9.
    Takasu, A.: Bayesian Similarity Model Estimation for Approximate Recognized Text Search. In: Proc. of the Int. Conf. on Doc. Ana. and Reco., pp. 611–615 (2009)Google Scholar
  10. 10.
    Saigo, H., Vert, J.-P., Akutsu, T.: Optimizing amino acid substitution matrices with a local alignment kernel. BMC Bioinformatics 7(246), 1–12 (2006)Google Scholar
  11. 11.
    Balcan, M.F., Blum, A.: On a Theory of Learning with Similarity Functions. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 73–80 (2006)Google Scholar
  12. 12.
    Balcan, M.F., Blum, A., Srebro, N.: Improved Guarantees for Learning via Similarity Functions. In: Proc. of the Conf. on Learning Theory (COLT), pp. 287–298 (2008)Google Scholar
  13. 13.
    Bousquet, O., Elisseeff, A.: Stability and generalization. Journal of Machine Learning Research 2, 499–526 (2002)MathSciNetzbMATHGoogle Scholar
  14. 14.
    Wang, L., Yang, C., Feng, J.: On Learning with Dissimilarity Functions. In: Proc. of the Int. Conf. on Machine Learning (ICML), pp. 991–998 (2007)Google Scholar
  15. 15.
    Zhu, J., Rosset, S., Hastie, T., Tibshirani, R.: 1-norm Support Vector Machines. In: Adv. in Neural Inf. Proc. Sys. (NIPS), vol. 16, pp. 49–56 (2003)Google Scholar
  16. 16.
    Henikoff, S., Henikoff, J.G.: Amino acid substitution matrices from protein blocks. Proc. of the National Academy of Sciences of the United States of America 89, 10915–10919 (1992)CrossRefGoogle Scholar
  17. 17.
    McCallum, A., Bellare, K., Pereira, F.: A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance. In: Conference on Uncertainty in AI, pp. 388–395 (2005)Google Scholar
  18. 18.
    McDiarmid, C.: On the method of bounded differences. In: Surveys in Combinatorics, pp. 148–188. Cambridge University Press, Cambridge (1989)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Aurélien Bellet
    • 1
  • Amaury Habrard
    • 2
  • Marc Sebban
    • 1
  1. 1.Laboratoire Hubert Curien UMR CNRS 5516University of Jean MonnetSaint-Etienne Cedex 2France
  2. 2.Laboratoire d’Informatique Fondamentale UMR CNRS 6166University of Aix-MarseilleMarseille Cedex 13France

Personalised recommendations