Skip to main content

Exploring Classification Concept Drift on a Large News Text Corpus

  • Conference paper
Book cover Computational Linguistics and Intelligent Text Processing (CICLing 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7181))

Abstract

Concept drift has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. We characterize concept drift on a seven-year-long Le Monde news corpus and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)

    MATH  Google Scholar 

  2. Forman, G.: Tackling concept drift by temporal inductive transfer. Technical Report HPL-2006-20R1, Hewlett Packard Laboratories (2006)

    Google Scholar 

  3. Katakis, I., Tsoumakas, G., Banos, E., Bassiliades, N., Vlahavas, I.P.: An adaptive personalized news dissemination system. J. Intell. Inf. Syst. 32(2), 191–212 (2009)

    Article  Google Scholar 

  4. Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3), 281–300 (2004)

    Google Scholar 

  5. Klinkenberg, R., Rüping, S.: Concept drift and the importance of examples. In: Text Mining – Theoretical Aspects and Applications, pp. 55–78. Physica-Verlag (2003)

    Google Scholar 

  6. Lang, K.: Newsweeder: Learning to filter netnews. In: Proc. 12th ICML, pp. 331–339 (1995)

    Google Scholar 

  7. Lebanon, G., Zhao, Y.: Local likelihood modeling of temporal text streams. In: Proc. 25th ICML, pp. 552–559. ACM (2008)

    Google Scholar 

  8. Liu, R.-L., Lu, Y.-L.: Incremental context mining for adaptive document classification. In: Proc. 8th KDD, pp. 599–604. ACM (2002)

    Google Scholar 

  9. Mourão, F., da Rocha, L.C., Araújo, R.B., Couto, T., Gonçalves, M.A., Meira Jr., W.: Understanding temporal aspects in document classification. In: WSDM, pp. 159–170. ACM (2008)

    Google Scholar 

  10. Rakotomalala, R., Chauchat, J.-H., Pellegrino, F.: Accuracy estimation with clustered dataset. In: Proc. 5th AusDM, pp. 17–22. Australian Comp. Soc. (2006)

    Google Scholar 

  11. Rocha, L., Mourão, F., Pereira, A., Gonçalves, M.A., Meira Jr., W.: Exploiting temporal context in text classification. In: Proc. 17th Conf. Information and Knowledge Management. ACM (2008)

    Google Scholar 

  12. Salles, T., da Rocha, L.C., Pappa, G.L., Mourão, F., Meira Jr., W., Gonçalves, M.A.: Temporally-aware algorithms for document classification. In: Proc. 33rd SIGIR, pp. 307–314. ACM (2010)

    Google Scholar 

  13. Salton, G., Wong, A., Yang, A.C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 229–237 (1975)

    Article  Google Scholar 

  14. Scholz, M., Klinkenberg, R.: An ensemble classifier for drifting concepts. In: Proc. 2nd Int. Wksh. on Knowledge Discovery in Data Streams, pp. 53–64 (2005)

    Google Scholar 

  15. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  16. Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report, Trinity College Dublin (2004)

    Google Scholar 

  17. Widyantoro, D.H., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans. Knowl. Data Eng. 17(3), 401–412 (2005)

    Article  Google Scholar 

  18. Yeon, K., Song, M.S., Kim, Y., Choi, H., Park, C.: Model averaging via penalized regression for tracking concept drift. J. Comput. Graph. Stat. 19(2), 457–473 (2010)

    Article  MathSciNet  Google Scholar 

  19. Zliobaite, I.: Learning under concept drift: an overview. Technical report, Vilnius University (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Šilić, A., Dalbelo Bašić, B. (2012). Exploring Classification Concept Drift on a Large News Text Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28604-9_35

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28603-2

  • Online ISBN: 978-3-642-28604-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics