Skip to main content

The Problem of First Story Detection in Multiaspect Text Categorization

  • Conference paper
  • First Online:
Information Technology and Computational Physics (CITCEP 2016)

Abstract

The new concept of multiaspect text categorization (MTC), recently introduced in a series of our papers, may be viewed as a combination of the classic and well-known text categorization (TC) and some kind of sequential data classification. The first aspect of the problem, i.e., the assignment of a document to a category, may be addressed using one of the well-known techniques such as, e.g., the k-nearest neighbors method. The second aspect is, however, less standard and boils down to the assignment of a document to one of the sequences, called cases, of documents maintained within a category. Cases cannot be treated in the same way as categories as, first, they contain an ordered—by the time of arrival—set of documents, and second, they are usually represented in a training dataset by a (relatively) small number of documents. Moreover, it is assumed that new cases can emerge during the document collection lifetime. Hence, the assignment of a document to a case is a challenging task by itself, and then the deciding if a document starts a new case is even more difficult. In this paper, we deal with the latter problem, discussing it in the broader perspective of sequential data mining and comparing a number of approaches to solve it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley (1999)

    Google Scholar 

  3. Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing 2013. Cracow University of Technology (2013)

    Google Scholar 

  4. Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4-AC, 7–16 (2013)

    Google Scholar 

  5. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, August 17–19, 2015 (2015)

    Google Scholar 

  6. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015. IEEE 2015, pp. 1484–1490 (2015)

    Google Scholar 

  7. Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014)

    Google Scholar 

  8. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new approach to the multiaspect text categorization by using the support vector machines. In: De Tré, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J.W., Penczek, W., Zadrożny, S. (eds.) Challenging problems and solutions in intelligent systems, pp. 261–277. Springer International Publishing, Heidelberg (2016)

    Chapter  Google Scholar 

  9. Zadrożny, S., Kacprzyk, J., Gajewski, M.: Multiaspect text categorization problem solving: a nearest neighbours classifier based approaches and beyond. J. Autom. Mob. Rob. Intell. Syst. 9, 58–70 (2015)

    Google Scholar 

  10. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A hierarchy-aware approach to the multiaspect text categorization problem. In: Proceedings of the World Conference on Soft Computing, Berkeley, CA, US (2016, in press)

    Google Scholar 

  11. Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 688–693 (2002)

    Google Scholar 

  12. Allan, J. (ed.) Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers (2002)

    Google Scholar 

  13. Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)

    Google Scholar 

  14. Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM ’00, pp. 374–381. ACM, New York, NY, USA (2000)

    Google Scholar 

  15. Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retriev. 1(1–2), 69–90 (1999)

    Article  Google Scholar 

  16. Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)

    Article  MATH  Google Scholar 

  17. De Faria, E., Gonçalves, I., Gama, J., De Leon Ferreira Carvalho, A.: Evaluation of multiclass novelty detection algorithms for data streams. IEEE Trans. Knowl. Data Eng. 27(11), 2961–2973 (2015)

    Article  Google Scholar 

  18. Hofmann, D.B.T., Baker, L.D., Hofmann, T., Mccallum, A.K., Yang, Y.: A hierarchical probabilistic model for novelty detection in text (1999)

    Google Scholar 

  19. Hansen, L.K., Sigurdsson, S., Kolenda, T., Nielsen, F.A., Kjems, U., Larsen, J.: Modeling text with generalizable gaussian mixtures. In: Proceedings of ICASSP’2000, pp. 3494–3497. IEEE (1999)

    Google Scholar 

  20. De Faria, E., Gonçalves, I., De Leon Ferreira Carvalho, A., Gama, J.: Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016)

    Article  Google Scholar 

  21. Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)

    Google Scholar 

  22. Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6–9, 2002, Proceedings. Lecture Notes in Computer Science, vol. 2396, pp. 15–30. Springer (2002)

    Google Scholar 

  23. Zadrożny, S., Kacprzyk, J., Gajewski, M.: A solution of the multiaspect text categorization problem by a hybrid HMM and LDA based technique. In: 16th International Conference Information Processing and Management of Uncertainty in Knowledge-Based Systems, Eindhoven, The Netherlands (2016, in press)

    Google Scholar 

  24. Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR, pp. 65–72 (2000)

    Google Scholar 

  25. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  26. McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis (1989)

    Google Scholar 

  27. Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001). Vancouver, British Columbia, Canada], pp. 841–848. MIT Press (2001)

    Google Scholar 

  28. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA (2001)

    Google Scholar 

  29. Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986)

    Book  MATH  Google Scholar 

  30. Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab-an S4 package for kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)

    Article  Google Scholar 

  31. Bird, S., et al.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, pp. 1755–1759

    Google Scholar 

  32. R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org

  33. Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)

    Article  Google Scholar 

  34. Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., Li, S.: FNN: Fast Nearest Neighbor Search Algorithms and Applications, R package version 1.1 (2013). http://CRAN.R-project.org/package=FNN

  35. Liaw, A., Wiener, M.: Classification and regression by randomforest. R News, vol. 2, no. 3, pp. 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/

  36. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)

    Book  MATH  Google Scholar 

Download references

Acknowledgements

This work is partially supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sławomir Zadrożny .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing Switzerland

About this paper

Cite this paper

Zadrożny, S., Kacprzyk, J., Gajewski, M. (2017). The Problem of First Story Detection in Multiaspect Text Categorization. In: Kulczycki, P., Kóczy, L., Mesiar, R., Kacprzyk, J. (eds) Information Technology and Computational Physics. CITCEP 2016. Advances in Intelligent Systems and Computing, vol 462. Springer, Cham. https://doi.org/10.1007/978-3-319-44260-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44260-0_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44259-4

  • Online ISBN: 978-3-319-44260-0

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics