Learning Derived Words from Medical Corpora

  • Pierre Zweigenbaum
  • Natalia Grabar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2780)


Morphological knowledge (inflection, derivation, compounds) is useful for medical language processing. Some is available for medical English in the UMLS Specialist Lexicon, but not for the French language. Large corpora of medical texts can nowadays be obtained from the Web. We propose here a method, based on the cooccurrence of formally similar words, which takes advantage of such a corpus to learn morphological knowledge for French medical words. The relations obtained before filtering have an average precision of 75.6% after 5,000 word pairs. Detailed examination of the results obtained on a sample of 376 French SNOMED anatomy nouns shows that 91–94% of the proposed derived adjectives are correct, that 36% of the nouns receive a correct adjective, and that this method can add 41% more derived adjectives than SNOMED already specifies. We discuss these results and propose directions for improvement.


Word Pair Medical Corpus Association Score Local Precision Correct Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lovis, C., Baud, R., Michel, P.A., Scherrer, J.R.: A semi-automatic ICD encoder. J. Am. Med. Inform. Assoc. 3, 937–937 (1996)Google Scholar
  2. 2.
    Aronson, A.R.: Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. J. Am. Med. Inform. Assoc. 8 (2001)Google Scholar
  3. 3.
    Hahn, U., Honeck, M., Piotrowski, M., Schulz, S.: Subword segmentation: Leveling out morphological variations for medical document retrieval. J. Am. Med. Inform. Assoc. 8, 229–233 (2001)Google Scholar
  4. 4.
    Zweigenbaum, P., Darmoni, S.J., Grabar, N.: The contribution of morphological knowledge to French MeSH mapping for information retrieval. J. Am. Med. Inform. Assoc. 8, 796–800 (2001)Google Scholar
  5. 5.
    McCray, A.T., Srinivasan, S., Browne, A.C.: Lexical methods for managing variation in biomedical terminologies. In: Proc 18th Annu. Symp. Comput. Appl. Med. Care, Washington, pp. 235–239. Mc Graw Hill, New York (1994)Google Scholar
  6. 6.
    Weske-Heck, G., Zaiß, A., Zabel, M., Schulz, S., Giere, W., Schopen, M., Klar, R.: The German Specialist Lexicon. J. Am. Med. Inform. Assoc. 8 (2002)Google Scholar
  7. 7.
    Zweigenbaum, P., Baud, R., Burgun, A., Namer, F., Jarrousse, E., Grabar, N., Ruch, P., Le Duff, F., Thirion, B., Darmoni, S.: Towards a unified medical lexicon for French. In: Baud, R., Fieschi, M., Le Beux, P., Ruch, P. (eds.) Proceedings Medical Informatics Europe, pp. 415–420. IOS Press, Amsterdam (2003)Google Scholar
  8. 8.
    Lovis, C., Michel, P.A., Baud, R., Scherrer, J.R.: Word segmentation processing: a way to exponentially extend medical dictionaries. In: Greenes, R.A., Peterson, H.E., Protti, D.J. (eds.) Proc 8th World Congress on Medical Informatics, pp. 28–32 (1995)Google Scholar
  9. 9.
    Zweigenbaum, P.: Resources for the medical domain: medical terminologies, lexicons and corpora. ELRA Newsletter 6, 8–11 (2001)Google Scholar
  10. 10.
    Zweigenbaum, P., Grabar, N.: Automatic acquisition of morphological knowledge for medical language processing. In: Horn, W., Shahar, Y., Lindberg, G., Andreassen, S., Wyatt, J. (eds.) Artificial Intelligence in Medicine. LNCS (LNAI), pp. 416–420. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  11. 11.
    Grabar, N., Zweigenbaum, P.: Automatic acquisition of domain-specific morphological resources from thesauri. In: Proceedings of RIAO 2000: Content-Based Multimedia Information Access, Paris, France, C.I.D, pp. 765–784 (2000)Google Scholar
  12. 12.
    Jacquemin, C.: Guessing morphology from terms and corpora. In: Proc. 20th ACM SIGIR, Philadelphia, PA, pp. 156–167 (1997)Google Scholar
  13. 13.
    Xu, J., Croft, B.W.: Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems 16, 61–81 (1998)CrossRefGoogle Scholar
  14. 14.
    Gaussier, E.: Unsupervised learning of derivational morphology from inflectional lexicons. In: Kehler, A., Stolcke, A. (eds.) ACL workshop on Unsupervised Methods in Natural Language Learning, College Park, Md (1999)Google Scholar
  15. 15.
    Daille, B.: Identification des adjectifs relationnels en corpus. In: Amsili, P. (ed.) Proceedings of TALN 1999 (Traitement automatique des langues naturelles), Cargèse, ATALA, pp. 105–114 (1999)Google Scholar
  16. 16.
    Hathout, N., Namer, F., Dal, G.: An experimental constructional database: the MorTAL project. In: Boucher, P. (ed.) Many morphologies, pp. 178–209. Cascadilla Press, Somerville (2002)Google Scholar
  17. 17.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  18. 18.
    Hadouche, F.: Acquisition de resources morphologiques à partir de corpus. DESS d’ingénierie multilingue, Institut National des Langues et Civilisations Orientales, Paris (2002)Google Scholar
  19. 19.
    Côtè, R.A.: Répertoire d’anatomopathologie de la SNOMED internationale, vol. 3.4. Université de Sherbrooke, Sherbrooke, Québec. (1996)Google Scholar
  20. 20.
    Darmoni, S.J., Leroy, J.P., Thirion, B., Baudic, F., Douyere, M., Piot, J.: CISMeF: a structured health resource guide. Methods Inf. Med. 39, 30–35 (2000)Google Scholar
  21. 21.
    Grefenstette, G., Nioche, J.: Estimation of English and non-English language use on the WWW. In: Proceedings of RIAO 2000: Content-Based Multimedia Information Access, Paris, France, C.I.D, pp. 237–246 (2000)Google Scholar
  22. 22.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)Google Scholar
  23. 23.
    Namer, F.: FLEMM: un analyseur flexionnel du français à base de règles. Traitement Automatique des Langues 41, 523–547 (2000)Google Scholar
  24. 24.
    Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA (1999)zbMATHGoogle Scholar
  25. 25.
    Bodenreider, O., Zweigenbaum, P.: Identifying proper names in parallel medical terminologies. In: Hasman, A., Blobel, B., Dudeck, J., Engelbrecht, R., Gell, G., Prokosh, H.U. (eds.) Medical Infobahn for Europe—Proceedings of MIE 2000 and GMDS 2000, pp. 443–447. IOS Press, Amsterdam (2000)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Pierre Zweigenbaum
    • 1
  • Natalia Grabar
    • 1
  1. 1.Mission de recherche en Sciences et Technologies de l’Information MédicaleSTIM/DPA/DSI, Assistance Publique – Hôpitaux de Paris & ERM 202 INSERM 

Personalised recommendations