Paraphrase Identification in Telugu Using Machine Learning
Paraphrase identification is the task of determining whether two sentences convey similar meaning or not. Here, we have chosen count-based text representation methods, such as term-document matrix and term frequency-inverse document frequency matrix, along with the distributional representation methods of singular value decomposition and non-negative matrix factorization, which is iteratively used with different word share and minimum document frequency values. With the help of the above methods, the system will be able to learn features from the representations. These learned features are then used for measuring phrase-wise similarity between two sentences. The features are given to various machine learning classification algorithms and cross-validation accuracy is obtained. The corpus for this task has been created manually from different news domains. Due to the limitation of unavailability of the parser, only a set of collected data in the corpus has been used for this task. This is a first attempt in the task of paraphrase identification in Telugu language using this approach.
KeywordsParaphrase identification Count-based methods Distributional representation methods Corpus Classification algorithms
- 1.Dolan, B, Quirk, C., Brockett, C: Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In: Proceedings of the 20th international conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)Google Scholar
- 2.Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)Google Scholar
- 3.Finch, A., Hwang, Y.-S., Sumita, E.: Using machine translation evaluation techniques to determine sentence-level semantic equivalence. In: Proceedings of the Third International Workshop on Paraphrasing (IWP2005), pp. 17–24 (2005)Google Scholar
- 4.Praveena, R., Anand Kumar, M., Soman, K.P.: Chunking based malayalam paraphrase identification using unfolding recursive autoencoders, 922–928. https://doi.org/10.1109/ICACCI.2017.8125959
- 5.Mahalaksmi, S., Anand Kumar, M., Soman, K.P.: Paraphrase Detection for Tamil language using Deep learning algorithms. Int. J. Appl. Eng. Res. 10(17), 13929–13934 (2015)Google Scholar