Study of Different Document Representation Models for Finding Phrase-Based Similarity

  • Preeti Kathiria
  • Harshal Arolkar
Conference paper
Part of the Smart Innovation, Systems and Technologies book series (SIST, volume 106)


To find phrase-based similarity among documents, it should first analyze the text data stored within the document before applying any machine learning algorithms. As the analysis on textual data is difficult, the text is needed to be broken into words, phrases, or converted to numerical measure. To convert text data into numerical measure, the well-known bag-of-words with term frequency model or TF-IDF model can be used. The converted numerical data, broken words or phrases, are to be stored in some form like vector, tree, or graph known as document representation model. The focus of this paper is to show how different document representation models can store words, phrases, or converted numerical data to find phrase-based similarity. Phrase-based similarity methods make use of word proximity so it can be used to find syntactic similarities between documents in a corpus. The similarity is calculated based on the frequency of words or frequency of phrases in sentences. This paper analyzes and compares different representation models on different parameters to find phrase-based similarity.


Document representation model Phrase-based document similarity Document index graph model 


  1. 1.
    Chim, H., Deng, X.: Efficient phrase-based document similarity for clustering. IEEE Trans. Knowl. Data Eng. 20(9), 1217–1229 (2008)CrossRefGoogle Scholar
  2. 2.
    Hammouda, K.M., Kamel, M.S.: Phrase-based document similarity based on an index graph model. In: Proceedings of 2002 IEEE International Conference on Data Mining ICDM 2003, pp. 203–210. IEEE (2002)Google Scholar
  3. 3.
    Hammouda, K.M., Kamel, M.S.: Document similarity using a phrase indexing graph model. Knowl. Inf. Syst. 6(6), 710–727 (2004)CrossRefGoogle Scholar
  4. 4.
    Hussein, A.S.: Visualizing document similarity using n-grams and latent semantic analysis. In: SAI Computing Conference (SAI), pp. 269–279. IEEE (2016)Google Scholar
  5. 5.
    Kathiria, P., Ahluwalia, S.: A Naive method for ontology construction. Int. J. Soft Comput. Artif. Intelligen. Appl. (IJSCAI), 5(1), 53–62 (2016)Google Scholar
  6. 6.
    Li, Y., Chung, S.M.: Text document clustering based on frequent word sequences. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 293–294. ACM (2005)Google Scholar
  7. 7.
    Li, Y., Chung, S.M., Holt, J.D.: Text document clustering based on frequent word meaning sequences. Data Knowl. Eng. 64(1), 381–404 (2008)CrossRefGoogle Scholar
  8. 8.
    Momin, B.F., Kulkarni, P.J., Chaudhari, A.: Web document clustering using document index graph. In: International Conference on Advanced Computing and Communications, 2006. ADCOM 2006, pp. 32–37. IEEE (2006)Google Scholar
  9. 9.
    Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M: A comparison of two suffix tree-based document clustering algorithms. In: International Conference on Information and Emerging Technologies (ICIET), pp. 1–5. IEEE (2010)Google Scholar
  10. 10.
    Solka, J.L.: Text data mining: theory and methods. Stat. Surv. 2, 94–112 (2008)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Nirma UniversityAhmedabadIndia
  2. 2.GLS UniversityAhmedabadIndia

Personalised recommendations