Study of Different Document Representation Models for Finding Phrase-Based Similarity

Kathiria, Preeti; Arolkar, Harshal

doi:10.1007/978-981-13-1742-2_45

Study of Different Document Representation Models for Finding Phrase-Based Similarity

Preeti Kathiria⁵ &
Harshal Arolkar⁶

Conference paper
First Online: 30 December 2018

724 Accesses
2 Citations

Part of the book series: Smart Innovation, Systems and Technologies ((SIST,volume 106))

Abstract

To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)
Article Google Scholar
Hammouda, K.M., Kamel, M.S.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM 2003, pp. 203–210. IEEE (2002)
Google Scholar
Hammouda, K.M., Kamel, M.S.: Document similarity using a phrase indexing graph model. Knowl. Inf. Syst. 6(6), 710–727 (2004)
Article Google Scholar
Hussein, A.S.: Visualizing document similarity using n-grams and latent semantic analysis. In: SAI Computing Conference (SAI), pp. 269–279. IEEE (2016)
Google Scholar
Kathiria, P., Ahluwalia, S.: A Naive method for ontology construction. Int. J. Soft Comput. Artif. Intelligen. Appl. (IJSCAI), 5(1), 53–62 (2016)
Google Scholar
Li, Y., Chung, S.M.: Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 293–294. ACM (2005)
Google Scholar
Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)
Article Google Scholar
Momin, B.F., Kulkarni, P.J., Chaudhari, A.: Web document clustering using document index graph. In: International Conference on Advanced Computing and Communications, 2006. ADCOM 2006, pp. 32–37. IEEE (2006)
Google Scholar
Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M: A comparison of two suffix tree-based document clustering algorithms. In: International Conference on Information and Emerging Technologies (ICIET), pp. 1–5. IEEE (2010)
Google Scholar
Solka, J.L.: Text data mining: theory and methods. Stat. Surv. 2, 94–112 (2008)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Nirma University, Ahmedabad, India
Preeti Kathiria
GLS University, Ahmedabad, India
Harshal Arolkar

Authors

Preeti Kathiria
View author publications
You can also search for this author in PubMed Google Scholar
Harshal Arolkar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Preeti Kathiria .

Editor information

Editors and Affiliations

School of Computer Engineering, Kalinga Institute of Industrial Technology, Bhubaneswar, India
Suresh Chandra Satapathy
Sabar Institute of Technology, Gujarat Technological University, Ahmedabad, Gujarat, India
Amit Joshi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kathiria, P., Arolkar, H. (2019). Study of Different Document Representation Models for Finding Phrase-Based Similarity. In: Satapathy, S., Joshi, A. (eds) Information and Communication Technology for Intelligent Systems . Smart Innovation, Systems and Technologies, vol 106. Springer, Singapore. https://doi.org/10.1007/978-981-13-1742-2_45

Download citation

DOI: https://doi.org/10.1007/978-981-13-1742-2_45
Published: 30 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1741-5
Online ISBN: 978-981-13-1742-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics