Abstract
Document vector embeddings are numeric fixed length representations of text documents that can be used for machine learning and text mining purposes. We describe in this paper a new technique for generating document vectors. Our novel idea builds on the recently popular notion of neural word vector embeddings and combines this concept with the statistics of kernel density estimation. We show that robust document vectors can be produced using our new algorithm, and perform an experiment involving several challenging text classification datasets to demonstrate its effectiveness.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Version from December 2016.
- 2.
http://mattmahoney.net/dc/textdata.html as at December 2016.
References
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arxiv:1607.04606 (2016)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Duong, T.: ks: kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Soft. 21(7), 1–16 (2007)
Frank, E., Hall, M., Witten, I.: The WEKA workbench. In: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann (2016)
Goltz, S., Mayo, M.: Enhancing regulatory compliance by using artificial intelligence text mining to identify penalty clauses in legislation. In: Proceedings of Workshop on MIning and REasoning with Legal Texts (MIREL 2017) (2017, to appear)
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine Learning (ICML), pp. 377–384 (2006)
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Hinneburg, A., Gabriel, H.-H.: DENCLUE 2.0: Fast clustering based on kernel density estimation. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 70–80. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74825-0_7
Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using recursive neural networks. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1113–1122 (2014)
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arxiv:1607.01759 (2016)
Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Lau, J., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation, Technical report arxiv:1607.05368 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Technical report, arXiv preprint arxiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013) (2013)
Ott, M., Cardie, C., Hancock, J.: Negative deceptive opinion spam. In: Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics (2013)
Ott, M., Choi, Y., Cardie, C., Hancock., J.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics (2011)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of Empirical Methods on Natural Language Processing, pp. 79–86 (2002)
Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)
Simonoff, J.: Smoothing methods in statistics. Springer, New York (1996). doi:10.1007/978-1-4612-4026-6
Turney, P., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Mayo, M., Goltz, S. (2017). Constructing Document Vectors Using Kernel Density Estimates. In: Torra, V., Narukawa, Y., Honda, A., Inoue, S. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2017. Lecture Notes in Computer Science(), vol 10571. Springer, Cham. https://doi.org/10.1007/978-3-319-67422-3_16
Download citation
DOI: https://doi.org/10.1007/978-3-319-67422-3_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67421-6
Online ISBN: 978-3-319-67422-3
eBook Packages: Computer ScienceComputer Science (R0)