Skip to main content

An Empirical Study of Sections in Classifying Disease Outbreak Reports

  • Chapter

Part of the book series: Annals of Information Systems ((AOIS,volume 7))

Abstract

Identifying articles that relate to infectious diseases is a necessary step for any automatic bio-surveillance system that monitors news articles from the Internet. Unlike scientific articles that are available in a strongly structured form, news articles are usually loosely structured. In this chapter, we investigate the importance of each section and the effect of section weighting on the performance of text classification. The experimental results show that (1) classification models using the headline and leading sentence achieve a high performance in terms of F-score compared to other parts of the article; (2) all section with bag-of-word representation (full text) achieves the highest recall; and (3) section weighting information can help to improve accuracy.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Public Health Agency of Canada. Global Public Heath Intelligence Network (GPHIN), 2004. http://www.gphin.org.

  2. International Society for Infectious Diseases. ProMed Mail, 2001. http://www.promedmail.org.

  3. Sebastiani F. Machine learning in automated text categorization. ACM computing survey, 2002:34(1):1–47.

    Article  Google Scholar 

  4. Yang Y, Liu X. A re-examination of text categorization methods. In Proc. of 22th SIGIR, ACM International Conference on Research and Development in Information Retrieval, 1999:42–49.

    Google Scholar 

  5. Kudo T, Matsumoto Y. A boosting algorithm for classification of semistructured text. In Proceedings of the 2004 Conference on Empirical Methods in NLP, 2004:301–308.

    Google Scholar 

  6. Zaki MJ, Aggarwal CC. XRules: an effective structural classifier for XML data. In Proceedings of the ninth ACM SIGKDD International Conference, 2003:316–325.

    Google Scholar 

  7. Bloehdorn S, Hotho A. Boosting for text classification with semantic features. In Proceedings of the Workshop on Mining for and from the Semantic Web at the 10th ACM SIGKDD 2004, 2004:70–87.

    Google Scholar 

  8. Frürnkranz J, Mitchell T, Riloff E. A case study in using linguistic phrases for text categorization on the WWW . In Working Notes of the AAAI/ICML Workshop on Learning for Text Categorization, 1998:5–13.

    Google Scholar 

  9. Hotho A, Staab S, Stumme G. WordNet improves text document clustering. In Proceedings of the SIGIR 2003 Semantic Web Workshop 2003, 2003.

    Google Scholar 

  10. Scott S, Matwin S. Feature engineering for text classification. In Proceedings of International Conference on Machine Learning 1999, 1999:379–388.

    Google Scholar 

  11. van Dijk TA. Structures of news in the press. In: Discourse and Communication. Berlin: De Gruyter, 1985:69–93.

    Google Scholar 

  12. Mizuta Y, Collier N. Zone identification in biology articles as a basis for information extraction. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:29–35.

    Google Scholar 

  13. Sinclair G, Webber B. Classification from fulltext: A comparison of canonical sections of scientific papers. In Proceedings of Natural Language Processing in Biomedicine and Its Applications (JNLPBA) 2004, 2004:66–69.

    Google Scholar 

  14. Yetisgen-Yildiz M, Pratt W. The effect of feature representation on MEDLINE document classification. In AMIA Annu Symp Proc., 2005:849–853.

    Google Scholar 

  15. Shah PK, Perez-Iratxeta C, Bork P, Andrade MA. Information extraction from fulltext scientific articles: where are the keywords? BMC Bioinformatics 2003;4(1):20.

    Article  PubMed  Google Scholar 

  16. Schuemie MJ, Weeber M, Schjivenaars BJA, van Mulligen EM, van der Eijik CC, Jellier R, Mons B, Kors JA. Distribution of information in biomedical abstracts and fulltext publications. Bioinformatics 2004;20:2597–2604.

    Article  CAS  PubMed  Google Scholar 

  17. Hakenberg J, Rutsch J, Leser U. Tuning text classification for hereditary diseases with section weighting. In Proceedings of the First International Symposium on Semantic Mining in Biomedicine (SMBM), 2005:34–37.

    Google Scholar 

  18. Kawazoe A, Jin L, Shigematsu M, Barrero R, Taniguchi K, Collier N. The development of a schema for the annotation of terms in the BioCaster disease detection/tracking system. In Proceedings of the International Workshop on Biomedical Ontology in Action (KR-MED 2006), 2006:77–85.

    Google Scholar 

  19. World Health Organization. ICD10, International Statistical Classification of Diseases and Related Health Problems, Tenth Revision, 2004.

    Google Scholar 

  20. Mitchell TM. Machine Learning. Mc Graw Hill, 1997.

    Google Scholar 

  21. McCallum AK. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering, 1996. http://www.cs.cmu.edu/_mccallum/bow.

  22. Joachims T. Making large-scale SVM Learning Practical. In: Sch¨olkopf B, Burges C, Smola A, eds., Advances in Kernel Methods – Support VectorLearning. Cambridge: MIT Press, 1999.

    Google Scholar 

  23. Aronson AR, Bodenreider O, Demner-Fushman D, Fung KW, Lee VK, Mork JG, Neveol A, Peters L, Rogers WJ. From indexing the biomedical literature to coding clinical text: experience with MTI and machine learning approaches. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, 2007:105–112.

    Google Scholar 

  24. Doan S, Kawazoe A, Collier N. The role of roles in classifying annotated biomedical text. In Proceeding of ACL Workshop on BioNLP 2007: Biological, Translation and clinical language processing, Prague, Czech, 2007:17–24.

    Google Scholar 

  25. Yang Y. An evaluation of statistical approaches to text categorization. Inf Ret J 1999;1:69–90.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Son Doan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Doan, S., Conway, M., Collier, N. (2010). An Empirical Study of Sections in Classifying Disease Outbreak Reports. In: Lazakidou, A. (eds) Web-Based Applications in Healthcare and Biomedicine. Annals of Information Systems, vol 7. Springer, Boston, MA. https://doi.org/10.1007/978-1-4419-1274-9_4

Download citation

Publish with us

Policies and ethics