Weakly-Supervised Symptom Recognition for Rare Diseases in Biomedical Text

Holat, Pierre; Tomeh, Nadi; Charnois, Thierry; Battistelli, Delphine; Jaulent, Marie-Christine; Métivier, Jean-Philippe

doi:10.1007/978-3-319-46349-0_17

Pierre Holat¹⁷,
Nadi Tomeh¹⁷,
Thierry Charnois¹⁷,
Delphine Battistelli¹⁸,
Marie-Christine Jaulent¹⁹ &
…
Jean-Philippe Métivier²⁰

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9897))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1667 Accesses
3 Citations

Abstract

In this paper, we tackle the issue of symptom recognition for rare diseases in biomedical texts. Symptoms typically have more complex and ambiguous structure than other biomedical named entities. Furthermore, existing resources are scarce and incomplete. Therefore, we propose a weakly-supervised framework based on a combination of two approaches: sequential pattern mining under constraints and sequence labeling. We use unannotated biomedical paper abstracts with dictionaries of rare diseases and symptoms to create our training data. Our experiments show that both approaches outperform simple projection of the dictionaries on text, and their combination is beneficial. We also introduce a novel pattern mining constraint based on semantic similarity between words inside patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.ncbi.nlm.nih.gov/pubmed.
2.
http://human-phenotype-ontology.github.io.
3.
http://www.orphadata.org.
4.
Using the script provided by http://www.cnts.ua.ac.be/conll2000/chunking/output.html which take the same input data format (BIO) as our data.
5.
http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger.
6.
http://nlp.stanford.edu/software/CRF-NER.shtml.
7.
https://www.nlm.nih.gov/mesh/meshhome.html.
8.
https://www.nlm.nih.gov/research/umls/.

References

Agrawal, R., Srikant, R.: Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering, pp. 3–14 (1995)
Google Scholar
Béchet, N., Cellier, P., Charnois, T., Crémilleux, B.: Sequence mining under multiple constraints. In: Proceedings of the 30th Annual ACM Symposium on Applied Computing, pp. 908–914 (2015)
Google Scholar
Cohen, K.B.: BioNLP: biomedical text mining. In: Handbook of Natural Language Processing, 2nd edn. (2010)
Google Scholar
Doğan, R.I., Leaman, R., Lu, Z.: NCBI disease corpus: a resource for disease name recognition and concept normalization. J. Biomed. Inf. 47, 1–10 (2014)
Article Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 363–370 (2005)
Google Scholar
Harris, Z.S.: Distributional structure. Word 10(2–3), 146–162 (1954)
Article Google Scholar
Kokkinakis, D.: Developing resources for swedish bio-medical text mining. In: Proceedings of the 2nd International Symposium on Semantic Mining in Biomedicine (SMBM) (2006)
Google Scholar
Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: probabilistic models for segmenting and labeling sequence data (2001)
Google Scholar
Leaman, R., Miller, C., Gonzalez, G.: Enabling recognition of diseases in biomedical text with machine learning: corpus and benchmark. In: Proceedings of the 2009 Symposium on Languages in Biology and Medicine, vol. 82(9) (2009)
Google Scholar
Martin, L., Battistelli, D., Charnois, T.: Symptom extraction issue. In: Proceedings of BioNLP 2014, pp. 107–111 (2014)
Google Scholar
Métivier, J.P., Serrano, L., Charnois, T., Cuissart, B., Widlöcher, A.: Automatic symptom extraction from texts to enhance knowledge discovery on rare diseases. In: Holmes, J.H., Bellazzi, R., Sacchi, L., Peek, N. (eds.) Artificial Intelligence in Medicine. LNCS, vol. 9105, pp. 249–254. Springer, Heidelberg (2015). doi:10.1007/978-3-319-19551-3_33
Chapter Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pei, J., Han, J., Wang, W.: Constraint-based sequential pattern mining: the pattern-growth methods. J. Intell. Inf. Syst. 28(2), 133–160 (2007)
Article Google Scholar
Savova, G.K., Masanz, J.J., Ogren, P.V., Zheng, J., Sohn, S., Kipper-Schuler, K.C., Chute, C.G.: Mayo clinical text analysis and knowledge extraction system (ctakes): architecture, component evaluation and applications. J. Am. Med. Inf. Assoc. 17(5), 507–513 (2010)
Article Google Scholar
South, B.R., Shen, S., Jones, M., Garvin, J., Samore, M.H., Chapman, W.W., Gundlapalli, A.V.: Developing a manually annotated clinical document corpus to identify phenotypic information for inflammatory bowel disease. BMC Bioinform. 10(9), 1 (2009)
Google Scholar
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology, pp. 3–17 (1996)
Google Scholar
Uzuner, Ö., South, B.R., Shen, S., DuVall, S.L.: 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. J. Am. Med. Inf. Assoc. 18(5), 552–556 (2011)
Article Google Scholar
Wagholikar, K.B., Torii, M., Jonnalagadda, S.R., Liu, H.: Pooling annotated corpora for clinical concept extraction. J. Biomed. Semant. 4(1), 1–10 (2013)
Article Google Scholar

Download references

Acknowledgments

This work is supported by the French National Research Agency (ANR) as part of the project Hybride ANR-11-BS02-002 and the “Investissements d’Avenir” program (reference: ANR-10-LABX-0083).

Author information

Authors and Affiliations

LIPN, University of Paris 13, Sorbonne Paris Cité, Paris, France
Pierre Holat, Nadi Tomeh & Thierry Charnois
MoDyCo, University of Paris Ouest Nanterre La Défense, Paris, France
Delphine Battistelli
Inserm, Paris, France
Marie-Christine Jaulent
GREYC, University of Caen Basse-Normandie, Caen, France
Jean-Philippe Métivier

Authors

Pierre Holat
View author publications
You can also search for this author in PubMed Google Scholar
Nadi Tomeh
View author publications
You can also search for this author in PubMed Google Scholar
Thierry Charnois
View author publications
You can also search for this author in PubMed Google Scholar
Delphine Battistelli
View author publications
You can also search for this author in PubMed Google Scholar
Marie-Christine Jaulent
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Métivier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pierre Holat .

Editor information

Editors and Affiliations

Stockholm University , Stockholm, Sweden
Henrik Boström
Leiden University , Leiden, The Netherlands
Arno Knobbe
University of Porto , Porto, Portugal
Carlos Soares
Stockholm University , Stockholm, Sweden
Panagiotis Papapetrou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Holat, P., Tomeh, N., Charnois, T., Battistelli, D., Jaulent, MC., Métivier, JP. (2016). Weakly-Supervised Symptom Recognition for Rare Diseases in Biomedical Text. In: Boström, H., Knobbe, A., Soares, C., Papapetrou, P. (eds) Advances in Intelligent Data Analysis XV. IDA 2016. Lecture Notes in Computer Science(), vol 9897. Springer, Cham. https://doi.org/10.1007/978-3-319-46349-0_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-46349-0_17
Published: 21 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46348-3
Online ISBN: 978-3-319-46349-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics