Abstract
To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Hammouda, K.M., Kamel, M.S.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM 2003, pp. 203–210. IEEE (2002)
Hammouda, K.M., Kamel, M.S.: Document similarity using a phrase indexing graph model. Knowl. Inf. Syst. 6(6), 710–727 (2004)
Hussein, A.S.: Visualizing document similarity using n-grams and latent semantic analysis. In: SAI Computing Conference (SAI), pp. 269–279. IEEE (2016)
Kathiria, P., Ahluwalia, S.: A Naive method for ontology construction. Int. J. Soft Comput. Artif. Intelligen. Appl. (IJSCAI), 5(1), 53–62 (2016)
Li, Y., Chung, S.M.: Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 293–294. ACM (2005)
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Momin, B.F., Kulkarni, P.J., Chaudhari, A.: Web document clustering using document index graph. In: International Conference on Advanced Computing and Communications, 2006. ADCOM 2006, pp. 32–37. IEEE (2006)
Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M: A comparison of two suffix tree-based document clustering algorithms. In: International Conference on Information and Emerging Technologies (ICIET), pp. 1–5. IEEE (2010)
Solka, J.L.: Text data mining: theory and methods. Stat. Surv. 2, 94–112 (2008)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kathiria, P., Arolkar, H. (2019). Study of Different Document Representation Models for Finding Phrase-Based Similarity. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-13-1742-2_45
Download citation
DOI: https://doi.org/10.1007/978-981-13-1742-2_45
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1741-5
Online ISBN: 978-981-13-1742-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)