Skip to main content

Weakly-Supervised Symptom Recognition for Rare Diseases in Biomedical Text

  • Conference paper
  • First Online:
Advances in Intelligent Data Analysis XV (IDA 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9897))

Included in the following conference series:

Abstract

In this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore, existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.ncbi.nlm.nih.gov/pubmed.

  2. 2.

    http://human-phenotype-ontology.github.io.

  3. 3.

    http://www.orphadata.org.

  4. 4.

    Using the script provided by http://www.cnts.ua.ac.be/conll2000/chunking/output.html which take the same input data format (BIO) as our data.

  5. 5.

    http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger.

  6. 6.

    http://nlp.stanford.edu/software/CRF-NER.shtml.

  7. 7.

    https://www.nlm.nih.gov/mesh/meshhome.html.

  8. 8.

    https://www.nlm.nih.gov/research/umls/.

References

  1. Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14 (1995)

    Google Scholar 

  2. Béchet, N., Cellier, P., Charnois, T., Crémilleux, B.: Sequence mining under multiple constraints. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 908–914 (2015)

    Google Scholar 

  3. Cohen, K.B.: BioNLP: biomedical text mining. In: Handbook of Natural Language Processing, 2nd edn. (2010)

    Google Scholar 

  4. Doğan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 47, 1–10 (2014)

    Article  Google Scholar 

  5. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370 (2005)

    Google Scholar 

  6. Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)

    Article  Google Scholar 

  7. Kokkinakis, D.: Developing resources for swedish bio-medical text mining. In: Proceedings of the 2nd International Symposium on Semantic Mining in Biomedicine (SMBM) (2006)

    Google Scholar 

  8. Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)

    Google Scholar 

  9. Leaman, R., Miller, C., Gonzalez, G.: Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In: Proceedings of the 2009 Symposium on Languages in Biology and Medicine, vol. 82(9) (2009)

    Google Scholar 

  10. Martin, L., Battistelli, D., Charnois, T.: Symptom extraction issue. In: Proceedings of BioNLP 2014, pp. 107–111 (2014)

    Google Scholar 

  11. Métivier, J.P., Serrano, L., Charnois, T., Cuissart, B., Widlöcher, A.: Automatic symptom extraction from texts to enhance knowledge discovery on rare diseases. In: Holmes, J.H., Bellazzi, R., Sacchi, L., Peek, N. (eds.) Artificial Intelligence in Medicine. LNCS, vol. 9105, pp. 249–254. Springer, Heidelberg (2015). doi:10.1007/978-3-319-19551-3_33

    Chapter  Google Scholar 

  12. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  13. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

    Google Scholar 

  14. Pei, J., Han, J., Wang, W.: Constraint-based sequential pattern mining: the pattern-growth methods. J. Intell. Inf. Syst. 28(2), 133–160 (2007)

    Article  Google Scholar 

  15. Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 17(5), 507–513 (2010)

    Article  Google Scholar 

  16. South, B.R., Shen, S., Jones, M., Garvin, J., Samore, M.H., Chapman, W.W., Gundlapalli, A.V.: Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease. BMC Bioinform. 10(9), 1 (2009)

    Google Scholar 

  17. Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, pp. 3–17 (1996)

    Google Scholar 

  18. Uzuner, Ö., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18(5), 552–556 (2011)

    Article  Google Scholar 

  19. Wagholikar, K.B., Torii, M., Jonnalagadda, S.R., Liu, H.: Pooling annotated corpora for clinical concept extraction. J. Biomed. Semant. 4(1), 1–10 (2013)

    Article  Google Scholar 

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR) as part of the project Hybride ANR-11-BS02-002 and the “Investissements d’Avenir” program (reference: ANR-10-LABX-0083).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre Holat .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Holat, P., Tomeh, N., Charnois, T., Battistelli, D., Jaulent, MC., Métivier, JP. (2016). Weakly-Supervised Symptom Recognition for Rare Diseases in Biomedical Text. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science(), vol 9897. Springer, Cham. https://doi.org/10.1007/978-3-319-46349-0_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46349-0_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46348-3

  • Online ISBN: 978-3-319-46349-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics