An Empirical Study of Sections in Classifying Disease Outbreak Reports

Doan, Son; Conway, Mike; Collier, Nigel

doi:10.1007/978-1-4419-1274-9_4

An Empirical Study of Sections in Classifying Disease Outbreak Reports

Son Doan⁴,
Mike Conway⁵ &
Nigel Collier⁵

Chapter

957 Accesses
1 Citations
1 Altmetric

Part of the book series: Annals of Information Systems ((AOIS,volume 7))

Abstract

Identifying articles that relate to infectious diseases is a necessary step for any automatic bio-surveillance system that monitors news articles from the Internet. Unlike scientific articles that are available in a strongly structured form, news articles are usually loosely structured. In this chapter, we investigate the importance of each section and the effect of section weighting on the performance of text classification. The experimental results show that (1) classification models using the headline and leading sentence achieve a high performance in terms of F-score compared to other parts of the article; (2) all section with bag-of-word representation (full text) achieves the highest recall; and (3) section weighting information can help to improve accuracy.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Public Health Agency of Canada. Global Public Heath Intelligence Network (GPHIN), 2004. http://www.gphin.org.
International Society for Infectious Diseases. ProMed Mail, 2001. http://www.promedmail.org.
Sebastiani F. Machine learning in automated text categorization. ACM computing survey, 2002:34(1):1–47.
Article Google Scholar
Yang Y, Liu X. A re-examination of text categorization methods. In Proc. of 22th SIGIR, ACM International Conference on Research and Development in Information Retrieval, 1999:42–49.
Google Scholar
Kudo T, Matsumoto Y. A boosting algorithm for classification of semistructured text. In Proceedings of the 2004 Conference on Empirical Methods in NLP, 2004:301–308.
Google Scholar
Zaki MJ, Aggarwal CC. XRules: an effective structural classifier for XML data. In Proceedings of the ninth ACM SIGKDD International Conference, 2003:316–325.
Google Scholar
Bloehdorn S, Hotho A. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the 10th ACM SIGKDD 2004, 2004:70–87.
Google Scholar
Frürnkranz J, Mitchell T, Riloff E. A case study in using linguistic phrases for text categorization on the WWW . In Working Notes of the AAAI/ICML Workshop on Learning for Text Categorization, 1998:5–13.
Google Scholar
Hotho A, Staab S, Stumme G. WordNet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop 2003, 2003.
Google Scholar
Scott S, Matwin S. Feature engineering for text classification. In Proceedings of International Conference on Machine Learning 1999, 1999:379–388.
Google Scholar
van Dijk TA. Structures of news in the press. In: Discourse and Communication. Berlin: De Gruyter, 1985:69–93.
Google Scholar
Mizuta Y, Collier N. Zone identification in biology articles as a basis for information extraction. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:29–35.
Google Scholar
Sinclair G, Webber B. Classification from fulltext: A comparison of canonical sections of scientific papers. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:66–69.
Google Scholar
Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE document classification. In AMIA Annu Symp Proc., 2005:849–853.
Google Scholar
Shah PK, Perez-Iratxeta C, Bork P, Andrade MA. Information extraction from fulltext scientific articles: where are the keywords? BMC Bioinformatics 2003;4(1):20.
Article PubMed Google Scholar
Schuemie MJ, Weeber M, Schjivenaars BJA, van Mulligen EM, van der Eijik CC, Jellier R, Mons B, Kors JA. Distribution of information in biomedical abstracts and fulltext publications. Bioinformatics 2004;20:2597–2604.
Article CAS PubMed Google Scholar
Hakenberg J, Rutsch J, Leser U. Tuning text classification for hereditary diseases with section weighting. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM), 2005:34–37.
Google Scholar
Kawazoe A, Jin L, Shigematsu M, Barrero R, Taniguchi K, Collier N. The development of a schema for the annotation of terms in the BioCaster disease detection/tracking system. In Proceedings of the International Workshop on Biomedical Ontology in Action (KR-MED 2006), 2006:77–85.
Google Scholar
World Health Organization. ICD10, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, 2004.
Google Scholar
Mitchell TM. Machine Learning. Mc Graw Hill, 1997.
Google Scholar
McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/_mccallum/bow.
Joachims T. Making large-scale SVM Learning Practical. In: Sch¨olkopf B, Burges C, Smola A, eds., Advances in Kernel Methods – Support VectorLearning. Cambridge: MIT Press, 1999.
Google Scholar
Aronson AR, Bodenreider O, Demner-Fushman D, Fung KW, Lee VK, Mork JG, Neveol A, Peters L, Rogers WJ. From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, 2007:105–112.
Google Scholar
Doan S, Kawazoe A, Collier N. The role of roles in classifying annotated biomedical text. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, Prague, Czech, 2007:17–24.
Google Scholar
Yang Y. An evaluation of statistical approaches to text categorization. Inf Ret J 1999;1:69–90.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, 37203, USA
Son Doan
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo, Japan
Mike Conway & Nigel Collier

Authors

Son Doan
View author publications
You can also search for this author in PubMed Google Scholar
Mike Conway
View author publications
You can also search for this author in PubMed Google Scholar
Nigel Collier
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Son Doan .

Editor information

Editors and Affiliations

Department of Nursing, University of Peloponnese, 231 00, Sparta General Hospital Bldg., Sparta, Greece
Athina Lazakidou

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Doan, S., Conway, M., Collier, N. (2010). An Empirical Study of Sections in Classifying Disease Outbreak Reports. In: Lazakidou, A. (eds) Web-Based Applications in Healthcare and Biomedicine. Annals of Information Systems, vol 7. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1274-9_4

Download citation

DOI: https://doi.org/10.1007/978-1-4419-1274-9_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-1273-2
Online ISBN: 978-1-4419-1274-9
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics