The Problem of First Story Detection in Multiaspect Text Categorization

Zadrożny, Sławomir; Kacprzyk, Janusz; Gajewski, Marek

doi:10.1007/978-3-319-44260-0_1

Sławomir Zadrożny¹⁸,
Janusz Kacprzyk¹⁸ &
Marek Gajewski¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 462))

Included in the following conference series:

Congress on Information Technology, Computational and Experimental Physics

490 Accesses
1 Citations

Abstract

The new concept of multiaspect text categorization (MTC), recently introduced in a series of our papers, may be viewed as a combination of the classic and well-known text categorization (TC) and some kind of sequential data classification. The first aspect of the problem, i.e., the assignment of a document to a category, may be addressed using one of the well-known techniques such as, e.g., the k-nearest neighbors method. The second aspect is, however, less standard and boils down to the assignment of a document to one of the sequences, called cases, of documents maintained within a category. Cases cannot be treated in the same way as categories as, first, they contain an ordered—by the time of arrival—set of documents, and second, they are usually represented in a training dataset by a (relatively) small number of documents. Moreover, it is assumed that new cases can emerge during the document collection lifetime. Hence, the assignment of a document to a case is a challenging task by itself, and then the deciding if a document starts a new case is even more difficult. In this paper, we deal with the latter problem, discussing it in the broader perspective of sequential data mining and comparing a number of approaches to solve it.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press and Addison Wesley (1999)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing 2013. Cracow University of Technology (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4-AC, 7–16 (2013)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, WA, USA, August 17–19, 2015 (2015)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-like Intelligence, CIHLI 2015, Cape Town, South Africa, December 8–10, 2015. IEEE 2015, pp. 1484–1490 (2015)
Google Scholar
Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana 2014(1), 133–147 (2014)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new approach to the multiaspect text categorization by using the support vector machines. In: De Tré, G., Grzegorzewski, P., Kacprzyk, J., Owsiński, J.W., Penczek, W., Zadrożny, S. (eds.) Challenging problems and solutions in intelligent systems, pp. 261–277. Springer International Publishing, Heidelberg (2016)
Chapter Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: Multiaspect text categorization problem solving: a nearest neighbours classifier based approaches and beyond. J. Autom. Mob. Rob. Intell. Syst. 9, 58–70 (2015)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A hierarchy-aware approach to the multiaspect text categorization problem. In: Proceedings of the World Conference on Soft Computing, Berkeley, CA, US (2016, in press)
Google Scholar
Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, NY, USA: ACM, pp. 688–693 (2002)
Google Scholar
Allan, J. (ed.) Topic Detection and Tracking: Event-based Information. Kluwer Academic Publishers (2002)
Google Scholar
Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)
Google Scholar
Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the Ninth International Conference on Information and Knowledge Management, CIKM ’00, pp. 374–381. ACM, New York, NY, USA (2000)
Google Scholar
Yang, Y.: An evaluation of statistical approaches to text categorization. Inf. Retriev. 1(1–2), 69–90 (1999)
Article Google Scholar
Markou, M., Singh, S.: Novelty detection: a review—part 1: statistical approaches. Signal Process. 83(12), 2481–2497 (2003)
Article MATH Google Scholar
De Faria, E., Gonçalves, I., Gama, J., De Leon Ferreira Carvalho, A.: Evaluation of multiclass novelty detection algorithms for data streams. IEEE Trans. Knowl. Data Eng. 27(11), 2961–2973 (2015)
Article Google Scholar
Hofmann, D.B.T., Baker, L.D., Hofmann, T., Mccallum, A.K., Yang, Y.: A hierarchical probabilistic model for novelty detection in text (1999)
Google Scholar
Hansen, L.K., Sigurdsson, S., Kolenda, T., Nielsen, F.A., Kjems, U., Larsen, J.: Modeling text with generalizable gaussian mixtures. In: Proceedings of ICASSP’2000, pp. 3494–3497. IEEE (1999)
Google Scholar
De Faria, E., Gonçalves, I., De Leon Ferreira Carvalho, A., Gama, J.: Novelty detection in data streams. Artif. Intell. Rev. 45(2), 235–269 (2016)
Article Google Scholar
Chandola, V., Banerjee, A., Kumar, V.: Anomaly detection: a survey. ACM Comput. Surv. 41(3) (2009)
Google Scholar
Dietterich, T.G.: Machine learning for sequential data: a review. In: Caelli, T., Amin, A., Duin, R.P.W., Kamel, M.S., de Ridder, D. (eds.) Structural, Syntactic, and Statistical Pattern Recognition, Joint IAPR International Workshops SSPR 2002 and SPR 2002, Windsor, Ontario, Canada, August 6–9, 2002, Proceedings. Lecture Notes in Computer Science, vol. 2396, pp. 15–30. Springer (2002)
Google Scholar
Zadrożny, S., Kacprzyk, J., Gajewski, M.: A solution of the multiaspect text categorization problem by a hybrid HMM and LDA based technique. In: 16th International Conference Information Processing and Management of Uncertainty in Knowledge-Based Systems, Eindhoven, The Netherlands (2016, in press)
Google Scholar
Yang, Y., Ault, T., Pierce, T., Lattimer, C.W.: Improving text categorization methods for event tracking. In: SIGIR, pp. 65–72 (2000)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
McCullagh, P., Nelder, J.: Generalized Linear Models, 2nd edn. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis (1989)
Google Scholar
Ng, A.Y., Jordan, M.I.: On discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes. In: Dietterich, T.G., Becker, S., Ghahramani, Z. (eds.) Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3–8, 2001). Vancouver, British Columbia, Canada], pp. 841–848. MIT Press (2001)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA (2001)
Google Scholar
Silverman, B.W.: Density Estimation for Statistics and Data Analysis. Chapman and Hall, London (1986)
Book MATH Google Scholar
Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab-an S4 package for kernel methods in R. J. Stat. Softw. 11(9), 1–20 (2004)
Article Google Scholar
Bird, S., et al.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), Marrakesh, Morocco, pp. 1755–1759
Google Scholar
R Core Team: R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria (2014). http://www.R-project.org
Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1–54 (2008)
Article Google Scholar
Beygelzimer, A., Kakadet, S., Langford, J., Arya, S., Mount, D., Li, S.: FNN: Fast Nearest Neighbor Search Algorithms and Applications, R package version 1.1 (2013). http://CRAN.R-project.org/package=FNN
Liaw, A., Wiener, M.: Classification and regression by randomforest. R News, vol. 2, no. 3, pp. 18–22 (2002). http://CRAN.R-project.org/doc/Rnews/
Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)
Book MATH Google Scholar

Download references

Acknowledgements

This work is partially supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

Author information

Authors and Affiliations

Systems Research Institute, Polish Academy of Sciences, ul. Newelska 6, 01-447, Warszawa, Poland
Sławomir Zadrożny, Janusz Kacprzyk & Marek Gajewski

Authors

Sławomir Zadrożny
View author publications
You can also search for this author in PubMed Google Scholar
Janusz Kacprzyk
View author publications
You can also search for this author in PubMed Google Scholar
Marek Gajewski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sławomir Zadrożny .

Editor information

Editors and Affiliations

Faculty of Physics & Applied Comp. Sci, AGH University of Science and Technology Faculty of Physics & Applied Comp. Sci, Kraków, Poland
Piotr Kulczycki
Faculty of Engineering Sciences, Széchenyi István University Faculty of Engineering Sciences, Győr, Hungary
László T. Kóczy
Faculty of Civil Engineering, Slovak University of Technology Faculty of Civil Engineering, Bratislava, Slovakia
Radko Mesiar
Systems Research Institute, Polish Academy of Sciences Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zadrożny, S., Kacprzyk, J., Gajewski, M. (2017). The Problem of First Story Detection in Multiaspect Text Categorization. In: Kulczycki, P., Kóczy, L., Mesiar, R., Kacprzyk, J. (eds) Information Technology and Computational Physics. CITCEP 2016. Advances in Intelligent Systems and Computing, vol 462. Springer, Cham. https://doi.org/10.1007/978-3-319-44260-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-44260-0_1
Published: 01 June 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44259-4
Online ISBN: 978-3-319-44260-0
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics