Automatic Specialized vs. Non-specialized Text Differentiation: The Usability of Grammatical Features in a Latin Multilingual Context

  • M. Teresa Cabré
  • Iria da CunhaEmail author
  • Eric SanJuan
  • Juan-Manuel Torres-Moreno
  • Jorge Vivaldi
Part of the Educational Linguistics book series (EDUL, volume 19)


In this chapter it is shown that certain grammatical features, besides lexicon, have a strong potential to differentiate specialized texts from non-specialized texts. A tool including these features has been developed and it has been trained using machine learning techniques based on association rules using two sub-corpora (specialized vs. non-specialized), each one divided into training and test corpora. This tool has been evaluated and the results show that the used strategy is suitable to differentiate specialized texts from non-specialized texts. These results could be considered as an innovative perspective to research on domains related with terminology, specialized discourse and computational linguistics, with applications to automatic compilation of Languages for Specific Purposes (LSP) corpora and Adaptive Focused Information Retrieval (AFIR) among others.


Grammatical Features Specialized Texts Association Rules Information Retrieval Focus Grammatical Tagging 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work has been partially financed by the Spanish projects RICOTERM 4 (FFI2010-21365-C03-01) and APLE 2 (FFI2012-37260), and a Juan de la Cierva grant (JCI-2011-09665).


  1. Amir, A., Y. Aumann, R. Feldman, and M. Fresko. 2005. Maximal association rules: A tool for mining associations in text. Journal of Intelligent Information Systems 5(3): 333–345.CrossRefGoogle Scholar
  2. Borgelt, C., and R. Kruse. 2002. Induction of association rules: Apriori Implementation. In Proceedings of the 15th conference on computational statistics (Compstat 2002, Berlin, Germany). Heidelberg: Physika Verlag.Google Scholar
  3. Cabré, M.T. 1999. La terminología. Representación y comunicación. Barcelona: IULA-UPF.Google Scholar
  4. Cabré, M.T. 2007. Constituir un corpus de textos de especialidad: condiciones y posibilidades. In Les corpus en linguistique et en traductologie, ed. M. Ballard and C. Pineira-Tresmontant, 89–106. Arras: Artois Presses Université.Google Scholar
  5. Cabré, M.T., C. Bach, I. da Cunha, A. Morales, and J. Vivaldi. 2010. Comparación de algunas características lingüísticas del discurso especializado frente al discurso general: el caso del discurso económico. In Proceedings of the XXVII AESLA international conference: Ways and modes of human communication (AESLA 2009), 453–460. Ciudad Real: Universidad de Castilla-La Mancha.Google Scholar
  6. Cabré, M.T., I. da Cunha, E. SanJuan, J.M. Torres-Moreno, and J. Vivaldi. 2011. Automatic specialized vs. non-specialized texts differentiation: A first approach. In Technological innovation in the teaching and processing of LSPs: Proceedings of TISLID'10, ed. N. Talaván, E. Martín Monje, and F. Palazón, 301–310. Madrid: Universidad Nacional de Educación a Distancia (UNED).Google Scholar
  7. Cajolet-Laganière, H., and N. Maillet. 1995. Caractérisation des textes techniques québécois. Présence francophone 47: 113–147.Google Scholar
  8. Coulon, R. 1972. French as it is written by French sociologists. Bulletin pédagogique des IUT 18: 11–25.Google Scholar
  9. da Cunha, I., M.T. Cabré, E. SanJuan, G. Sierra, J.M. Torres-Moreno, and J. Vivaldi. 2011. Automatic specialized vs. Non-specialized sentence differentiation (Lecture notes in computer science 6609), 266–276. Berlin: Springer.Google Scholar
  10. El-Bèze, M., J.M. Torres-Moreno, and F. Béchet. 2007. Un duel probabiliste pour départager deux Présidents. Revue des Nouvelles Technologies de l’Information E-10: 117–126.Google Scholar
  11. Hoffmann, L. 1976. Kommunikationsmittel Fachsprache – Eine Einführung. Berlin: Sammlung Akademie Verlag.Google Scholar
  12. Kocourek, R. 1982. La langue française de la technique et de la science (2nd ed., 1991). Wiesbaden: Brandstetter.Google Scholar
  13. Kocourek, R. 1991. La langue française de la technique et de la science. Vers une linguistique de la langue savante. Wiesbaden: Oscar Branstetter.Google Scholar
  14. L’Homme, M.C. 1993. Contribution à l’analyse grammaticale de la langue despécialité : le mode, le temps et la personne du verbe dans quelques textes, scientifiques écrits à vocation pédagogique. Québec: Université Laval.Google Scholar
  15. L’Homme, M.C. 1995. Formes verbales de temps et texte scientifique. Le langage et l’homme 31(2–3): 107–123.Google Scholar
  16. Manning, C., and H. Schütze. 1999. Foundations of statistical natural language processing. Cambridge, MA: The MIT Press.Google Scholar
  17. Vivaldi, J. 2009. Corpus and exploitation tool: IULACT and bwanaNet. In A survey on corpus-based research. Proceedings of the I international conference on corpus linguistics (CICL-09), ed. P. Cantos Gómez and A. Sánchez Pérez, 224–239. Murcia: Universidad de Murcia.Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • M. Teresa Cabré
    • 1
  • Iria da Cunha
    • 1
    Email author
  • Eric SanJuan
    • 2
    • 3
  • Juan-Manuel Torres-Moreno
    • 2
    • 3
    • 4
  • Jorge Vivaldi
    • 1
  1. 1.Institut Universitari de Lingüística AplicadaUniversidad Pompeu Fabra (UPF)BarcelonaSpain
  2. 2.Departament Statistique et Informatique Décisionnelle, Laboratoire Informatique d’AvignonUniversité d’Avignon (UAPV)AvignonFrance
  3. 3.Brain and Language Research Institute (BLRI)AvignonFrance
  4. 4.École Polytechnique de MontréalMontrealCanada

Personalised recommendations