Constructing Document Vectors Using Kernel Density Estimates

Mayo, Michael; Goltz, Sean

doi:10.1007/978-3-319-67422-3_16

Michael Mayo¹⁷ &
Sean Goltz¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10571))

Included in the following conference series:

International Conference on Modeling Decisions for Artificial Intelligence

770 Accesses

Abstract

Document vector embeddings are numeric fixed length representations of text documents that can be used for machine learning and text mining purposes. We describe in this paper a new technique for generating document vectors. Our novel idea builds on the recently popular notion of neural word vector embeddings and combines this concept with the statistics of kernel density estimation. We show that robust document vectors can be produced using our new algorithm, and perform an experiment involving several challenging text classification datasets to demonstrate its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Version from December 2016.
2.
http://mattmahoney.net/dc/textdata.html as at December 2016.

References

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arxiv:1607.04606 (2016)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Duong, T.: ks: kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Soft. 21(7), 1–16 (2007)
Article Google Scholar
Frank, E., Hall, M., Witten, I.: The WEKA workbench. In: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann (2016)
Google Scholar
Goltz, S., Mayo, M.: Enhancing regulatory compliance by using artificial intelligence text mining to identify penalty clauses in legislation. In: Proceedings of Workshop on MIning and REasoning with Legal Texts (MIREL 2017) (2017, to appear)
Google Scholar
Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine Learning (ICML), pp. 377–384 (2006)
Google Scholar
Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)
Article Google Scholar
Hinneburg, A., Gabriel, H.-H.: DENCLUE 2.0: Fast clustering based on kernel density estimation. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 70–80. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74825-0_7
Chapter Google Scholar
Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using recursive neural networks. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1113–1122 (2014)
Google Scholar
Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arxiv:1607.01759 (2016)
Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)
Article MATH Google Scholar
Lau, J., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation, Technical report arxiv:1607.05368 (2016)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Technical report, arXiv preprint arxiv:1301.3781 (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013) (2013)
Google Scholar
Ott, M., Cardie, C., Hancock, J.: Negative deceptive opinion spam. In: Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics (2013)
Google Scholar
Ott, M., Choi, Y., Cardie, C., Hancock., J.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics (2011)
Google Scholar
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of Empirical Methods on Natural Language Processing, pp. 79–86 (2002)
Google Scholar
Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)
Book MATH Google Scholar
Simonoff, J.: Smoothing methods in statistics. Springer, New York (1996). doi:10.1007/978-1-4612-4026-6
Book MATH Google Scholar
Turney, P., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Waikato, Hamilton, New Zealand
Michael Mayo & Sean Goltz

Authors

Michael Mayo
View author publications
You can also search for this author in PubMed Google Scholar
Sean Goltz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Mayo .

Editor information

Editors and Affiliations

University of Skövde, Skövde, Sweden
Vicenç Torra
Toho Gakuen, Kunitachi, Tokyo, Japan
Yasuo Narukawa
Kyushu Institute of Technology, Iizuka, Fukuoka, Japan
Aoi Honda
Kyushu Institute of Technology, Kitakyushu-shi, Fukuoka, Japan
Sozo Inoue

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mayo, M., Goltz, S. (2017). Constructing Document Vectors Using Kernel Density Estimates. In: Torra, V., Narukawa, Y., Honda, A., Inoue, S. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2017. Lecture Notes in Computer Science(), vol 10571. Springer, Cham. https://doi.org/10.1007/978-3-319-67422-3_16

Download citation

DOI: https://doi.org/10.1007/978-3-319-67422-3_16
Published: 13 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-67421-6
Online ISBN: 978-3-319-67422-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics