Similarity Word-Sequence Kernels for Sentence Clustering

  • Jesús Andrés-Ferrer
  • Germán Sanchis-Trilles
  • Francisco Casacuberta
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6218)


In this paper, we present a novel clustering approach based on the use of kernels as similarity functions and the C-means algorithm. Several word-sequence kernels are defined and extended to verify the properties of similarity functions. Afterwards, these monolingual word-sequence kernels are extended to bilingual word-sequence kernels, and applied to the task of monolingual and bilingual sentence clustering. The motivation of this proposal is to group similar sentences into clusters so that specialised models can be trained for each cluster, with the purpose of reducing in this way both the size and complexity of the initial task. We provide empirical evidence for proving that the use of bilingual kernels can lead to better clusters, in terms of intra-cluster perplexities.


Support Vector Machine Machine Translation Text Categorisation Statistical Machine Translation Text Cluster 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    McCallum, A., Nigam, K.: A comparison of event models for naive bayes text classification. In: Proc. of AAAI/ICML 1998 Workshop on Learning for Text Categorization, pp. 41–48. AAAI Press, Menlo Park (1998)Google Scholar
  2. 2.
    Joachims, T.: Text categorisation with support vector machines: learning with many relevant features. In: Proceedings of 10th ECML, pp. 137–142 (1998)Google Scholar
  3. 3.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. JMLR 2, 419–444 (2002)zbMATHCrossRefGoogle Scholar
  4. 4.
    Karatzoglou, A., Feinerer, I.: Text clustering with string kernels in r. In: Proc. of the 30th Annual Conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin (March 2006)Google Scholar
  5. 5.
    Sanchis-Trilles, G., Cettolo, M.: Online language model adaptation via n-gram mixtures in statistical machine translation. In: Proc. of 14th Annual Conference of the European Association for Machine Translation, Saint-Raphaël, France, May 27-28, (2010)Google Scholar
  6. 6.
    Lagarda, A., Juan, A.: Topic detection and classification techniques. In: WP4 deliverable, TransType2 (2003)Google Scholar
  7. 7.
    Cortes, C., Mohri, M., Weston, J.: A general regression technique for learning transductions. In: Proc. of 22nd. ICML, pp. 153–160. ACM, New York (2005)Google Scholar
  8. 8.
    Serrano, N., Andrés-Ferrer, J., Casacuberta, F.: On a kernel regression approach to machine translation. In: IbPRIA 2009. LNCS, vol. 5524, pp. 394–401. Springer, Heidelberg (2009)Google Scholar
  9. 9.
    Szedmak, Z.W.S.T.: Kernel regression based machine translation, pp. 185–188. Association for Computational Linguistics (2007)Google Scholar
  10. 10.
    Jain, A.K., Dubes, R.C.: Algorithms for clustering data. Prentice-Hall, Inc., Upper Saddle River (1988)zbMATHGoogle Scholar
  11. 11.
    Lloyd, S.P.: Least squares quantization in pcm. IEEE Transactions on Information Theory 28(2), 129–137 (1982)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Girolami, M.: Mercer kernel based clustering in feature space. IEEE Transactions on Neural Networks (2001)Google Scholar
  13. 13.
    Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Philosophical Transactions of the Royal Society London (A) 209, 415–446 (1909)Google Scholar
  14. 14.
    Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (1995)zbMATHGoogle Scholar
  15. 15.
    Boser, B., Guyon, I., Vapnik, V.: A training algorithm for optimal margin classifiers. In: Annual Workshop on Computational Learning Theory, pp. 144–152 (1992)Google Scholar
  16. 16.
    Cortes, C., Vapnik, V.: Support-vector networks. Machine Learning 20(3), 273–297 (1995)zbMATHGoogle Scholar
  17. 17.
    Cancedda, N., Gaussier, E., Goutteand, C., Renders, J.: Word-sequence kernels. Journal of Machine Learning Research 3, 1059–1082 (2003)zbMATHCrossRefGoogle Scholar
  18. 18.
    Stolcke, A.: SRILM – an extensible language modeling toolkit. In: Proc. of ICSLP 2002, September 2002, pp. 901–904 (2002)Google Scholar
  19. 19.
    Kneser, R., Ney, H.: Improved backing-off for m-gram language modeling. IEEE Int. Conf. on Acoustics, Speech and Signal Processing II, 181–184 (1995)Google Scholar
  20. 20.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proc. of the MT Summit X, pp. 79–86 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Jesús Andrés-Ferrer
    • 1
  • Germán Sanchis-Trilles
    • 1
  • Francisco Casacuberta
    • 1
  1. 1.Instituto Tecnológico de Informática, Departamento de Sistemas Informáticos y ComputaciónUniversidad Politécnica de Valencia 

Personalised recommendations