Skip to main content

Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites

  • Conference paper
  • First Online:
Intelligent Computing and Optimization (ICO 2019)

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1072))

Included in the following conference series:

Abstract

In this paper, we compare two models of document similarity search over a text stream of articles which are collected daily from online News sites. The first model uses the word to vector (Word2Vec), neural-network-based document embedding is known as the document to vector (Doc2Vec) and k-NN technique to perform similarity search in a tree structure called M-Tree. The second model applies Gensim model to do the same job of document similarity search. We use the metric which measures the accuracy of the documents similar to document d when considering if they are in the same category with the document d or not. We also do the experiment and evaluation, analyze experimental results, discuss and propose solutions for improvement. Our main contributions are to compare the two solutions in performing document similarity queries.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Zheng, Y., Lu, R., Shao, J.: Achieving efficient and privacy-preserving k-NN query for outsourced eHealthcare data. J. Med. Syst. 43(5), 123 (2019)

    Article  Google Scholar 

  2. Kamarulzalis, A.H., Abdullah, M.A.A.: An improvement algorithm for iris classification by using linear support vector machine (LSVM), k-nearest neighbours (k-NN) and random nearest neighbors (RNN). J. Math. Comput. Sci. 5(1), 32–38 (2019)

    Google Scholar 

  3. Liu, Z.-G., et al.: A new pattern classification improvement method with local quality matrix based on K-NN. Knowl.-Based Syst. 164, 336–347 (2019)

    Article  Google Scholar 

  4. Hong Phuong, L., Thi Minh Huyên, N., Roussanaly, A., Vinh, H.T.: A hybrid approach to word segmentation of vietnamese texts. In: Language and Automata Theory and Applications, p. 240 (2008)

    Google Scholar 

  5. Hong, T.V.T., Do, P.: Developing a graph-based system for storing, exploiting and visualizing text stream. In: Proceedings of the 2nd International Conference on Machine Learning and Soft Computing. ACM (2018)

    Google Scholar 

  6. Streiner, D.L., Cairney, J.: What’s under the ROC? An introduction to receiver operating characteristics curves. Can. J. Psychiatry 52(2), 121–128 (2007)

    Article  Google Scholar 

Download references

Acknowledgments

This research is funded by Thu Dau Mot university, Binh Duong, and Vietnam National University Ho Chi Minh City (VNU-HCMC) under the grant number B2017-26-02.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tham Vo Thi Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hong, T.V.T., Do, P. (2020). Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites. In: Vasant, P., Zelinka, I., Weber, GW. (eds) Intelligent Computing and Optimization. ICO 2019. Advances in Intelligent Systems and Computing, vol 1072. Springer, Cham. https://doi.org/10.1007/978-3-030-33585-4_38

Download citation

Publish with us

Policies and ethics