Abstract
Identifying articles that relate to infectious diseases is a necessary step for any automatic bio-surveillance system that monitors news articles from the Internet. Unlike scientific articles that are available in a strongly structured form, news articles are usually loosely structured. In this chapter, we investigate the importance of each section and the effect of section weighting on the performance of text classification. The experimental results show that (1) classification models using the headline and leading sentence achieve a high performance in terms of F-score compared to other parts of the article; (2) all section with bag-of-word representation (full text) achieves the highest recall; and (3) section weighting information can help to improve accuracy.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Public Health Agency of Canada. Global Public Heath Intelligence Network (GPHIN), 2004. http://www.gphin.org.
International Society for Infectious Diseases. ProMed Mail, 2001. http://www.promedmail.org.
Sebastiani F. Machine learning in automated text categorization. ACM computing survey, 2002:34(1):1–47.
Yang Y, Liu X. A re-examination of text categorization methods. In Proc. of 22th SIGIR, ACM International Conference on Research and Development in Information Retrieval, 1999:42–49.
Kudo T, Matsumoto Y. A boosting algorithm for classification of semistructured text. In Proceedings of the 2004 Conference on Empirical Methods in NLP, 2004:301–308.
Zaki MJ, Aggarwal CC. XRules: an effective structural classifier for XML data. In Proceedings of the ninth ACM SIGKDD International Conference, 2003:316–325.
Bloehdorn S, Hotho A. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the 10th ACM SIGKDD 2004, 2004:70–87.
Frürnkranz J, Mitchell T, Riloff E. A case study in using linguistic phrases for text categorization on the WWW . In Working Notes of the AAAI/ICML Workshop on Learning for Text Categorization, 1998:5–13.
Hotho A, Staab S, Stumme G. WordNet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop 2003, 2003.
Scott S, Matwin S. Feature engineering for text classification. In Proceedings of International Conference on Machine Learning 1999, 1999:379–388.
van Dijk TA. Structures of news in the press. In: Discourse and Communication. Berlin: De Gruyter, 1985:69–93.
Mizuta Y, Collier N. Zone identification in biology articles as a basis for information extraction. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:29–35.
Sinclair G, Webber B. Classification from fulltext: A comparison of canonical sections of scientific papers. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:66–69.
Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE document classification. In AMIA Annu Symp Proc., 2005:849–853.
Shah PK, Perez-Iratxeta C, Bork P, Andrade MA. Information extraction from fulltext scientific articles: where are the keywords? BMC Bioinformatics 2003;4(1):20.
Schuemie MJ, Weeber M, Schjivenaars BJA, van Mulligen EM, van der Eijik CC, Jellier R, Mons B, Kors JA. Distribution of information in biomedical abstracts and fulltext publications. Bioinformatics 2004;20:2597–2604.
Hakenberg J, Rutsch J, Leser U. Tuning text classification for hereditary diseases with section weighting. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM), 2005:34–37.
Kawazoe A, Jin L, Shigematsu M, Barrero R, Taniguchi K, Collier N. The development of a schema for the annotation of terms in the BioCaster disease detection/tracking system. In Proceedings of the International Workshop on Biomedical Ontology in Action (KR-MED 2006), 2006:77–85.
World Health Organization. ICD10, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, 2004.
Mitchell TM. Machine Learning. Mc Graw Hill, 1997.
McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/_mccallum/bow.
Joachims T. Making large-scale SVM Learning Practical. In: Sch¨olkopf B, Burges C, Smola A, eds., Advances in Kernel Methods – Support VectorLearning. Cambridge: MIT Press, 1999.
Aronson AR, Bodenreider O, Demner-Fushman D, Fung KW, Lee VK, Mork JG, Neveol A, Peters L, Rogers WJ. From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, 2007:105–112.
Doan S, Kawazoe A, Collier N. The role of roles in classifying annotated biomedical text. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, Prague, Czech, 2007:17–24.
Yang Y. An evaluation of statistical approaches to text categorization. Inf Ret J 1999;1:69–90.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Doan, S., Conway, M., Collier, N. (2010). An Empirical Study of Sections in Classifying Disease Outbreak Reports. In: Lazakidou, A. (eds) Web-Based Applications in Healthcare and Biomedicine. Annals of Information Systems, vol 7. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1274-9_4
Download citation
DOI: https://doi.org/10.1007/978-1-4419-1274-9_4
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4419-1273-2
Online ISBN: 978-1-4419-1274-9
eBook Packages: Business and EconomicsBusiness and Management (R0)