Skip to main content

Constructing Document Vectors Using Kernel Density Estimates

  • Conference paper
  • First Online:
Book cover Modeling Decisions for Artificial Intelligence (MDAI 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10571))

  • 770 Accesses

Abstract

Document vector embeddings are numeric fixed length representations of text documents that can be used for machine learning and text mining purposes. We describe in this paper a new technique for generating document vectors. Our novel idea builds on the recently popular notion of neural word vector embeddings and combines this concept with the statistics of kernel density estimation. We show that robust document vectors can be produced using our new algorithm, and perform an experiment involving several challenging text classification datasets to demonstrate its effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Version from December 2016.

  2. 2.

    http://mattmahoney.net/dc/textdata.html as at December 2016.

References

  1. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arxiv:1607.04606 (2016)

  2. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  3. Duong, T.: ks: kernel density estimation and kernel discriminant analysis for multivariate data in R. J. Stat. Soft. 21(7), 1–16 (2007)

    Article  Google Scholar 

  4. Frank, E., Hall, M., Witten, I.: The WEKA workbench. In: Data Mining: Practical Machine Learning Tools and Techniques, 4th edn. Morgan Kaufmann (2016)

    Google Scholar 

  5. Goltz, S., Mayo, M.: Enhancing regulatory compliance by using artificial intelligence text mining to identify penalty clauses in legislation. In: Proceedings of Workshop on MIning and REasoning with Legal Texts (MIREL 2017) (2017, to appear)

    Google Scholar 

  6. Greene, D., Cunningham, P.: Practical solutions to the problem of diagonal dominance in kernel document clustering. In: Proceedings of 23rd International Conference on Machine Learning (ICML), pp. 377–384 (2006)

    Google Scholar 

  7. Harris, Z.: Distributional structure. Word 10(23), 146–162 (1954)

    Article  Google Scholar 

  8. Hinneburg, A., Gabriel, H.-H.: DENCLUE 2.0: Fast clustering based on kernel density estimation. In: R. Berthold, M., Shawe-Taylor, J., Lavrač, N. (eds.) IDA 2007. LNCS, vol. 4723, pp. 70–80. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74825-0_7

    Chapter  Google Scholar 

  9. Iyyer, M., Enns, P., Boyd-Graber, J., Resnik, P.: Political ideology detection using recursive neural networks. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1113–1122 (2014)

    Google Scholar 

  10. Joulin, A., Grave, E., Bojanowski, P., Mikolov, T.: Bag of tricks for efficient text classification. arXiv preprint arxiv:1607.01759 (2016)

  11. Keerthi, S., Shevade, S., Bhattacharyya, C., Murthy, K.: Improvements to Platt’s SMO algorithm for SVM classifier design. Neural Comput. 13(3), 637–649 (2001)

    Article  MATH  Google Scholar 

  12. Lau, J., Baldwin, T.: An empirical evaluation of doc2vec with practical insights into document embedding generation, Technical report arxiv:1607.05368 (2016)

  13. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. Technical report, arXiv preprint arxiv:1301.3781 (2013)

  14. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26 (NIPS 2013) (2013)

    Google Scholar 

  15. Ott, M., Cardie, C., Hancock, J.: Negative deceptive opinion spam. In: Proceedings of 2013 Conference of the North American Chapter of the Association for Computational Linguistics (2013)

    Google Scholar 

  16. Ott, M., Choi, Y., Cardie, C., Hancock., J.: Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of 49th Annual Meeting of the Association for Computational Linguistics (2011)

    Google Scholar 

  17. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of Empirical Methods on Natural Language Processing, pp. 79–86 (2002)

    Google Scholar 

  18. Silverman, B.: Density Estimation for Statistics and Data Analysis. Chapman & Hall/CRC, London (1986)

    Book  MATH  Google Scholar 

  19. Simonoff, J.: Smoothing methods in statistics. Springer, New York (1996). doi:10.1007/978-1-4612-4026-6

    Book  MATH  Google Scholar 

  20. Turney, P., Pantel, P.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37, 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michael Mayo .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Mayo, M., Goltz, S. (2017). Constructing Document Vectors Using Kernel Density Estimates. In: Torra, V., Narukawa, Y., Honda, A., Inoue, S. (eds) Modeling Decisions for Artificial Intelligence. MDAI 2017. Lecture Notes in Computer Science(), vol 10571. Springer, Cham. https://doi.org/10.1007/978-3-319-67422-3_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-67422-3_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-67421-6

  • Online ISBN: 978-3-319-67422-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics