Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language Using Machine Learning

Saini, Anuj; Verma, Aayushi

doi:10.1007/978-3-319-73606-8_11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Forum for Information Retrieval Evaluation

621 Accesses
2 Citations

Abstract

Every language possesses plausible several interpretations. With the evolution of web, smart devices and social media it has become a challenging task to identify these syntactic or semantic ambiguities. In Natural Language Processing, two statements written using different words having same meaning is termed as paraphrasing. At FIRE 2016, we have worked upon the problem of detecting paraphrases for the given Shared Task DPIL (Detecting Paraphrases in Indian Languages) in Hindi Language specifically. This paper proposes a novel approach to identify if two statements are paraphrased or not using various machine learning algorithms like Random Forest, Support Vector Machine, Gradient Boosting and Gaussian Naïve Bayes on the given training data set of two subtasks. In cross validation experiments, Random Forest outperforms the other methods with F1-score of 0.94. We have extended our work by adding few more features and using the former best classifier resulting in improvement of F1-score by 1%. The experimental results depict that our algorithm got the highest F1-score and accuracy and hence, secured the first rank in Hindi language in this shared task among all participants. Our novel approach can be used in various applications such as question-answering system, document clustering, machine translation, text summarization, plagiarism detection and many more.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Sethi, N., Agrawal, P., Madaan, V., Singh, S.K.: A novel approach to paraphrase Hindi sentences using natural language processing. Indian J. Sci. Technol. 9(28), July 2016. https://doi.org/10.17485/ijst/2016/v9i28/98374
Kumar, N.: A graph based automatic plagiarism detection technique to handle artificial word reordering and paraphrasing. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 481–494. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_40
Chapter Google Scholar
Xu, W., Callison-Burch, C., Dolan, W.B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, June 4–5, pp. 1–11. Association for Computational Linguistics (2015)
Google Scholar
https://www.aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_(State_of_the_art)
http://www.cfilt.iitb.ac.in/wordnet/webhwn/downloaderInfo.php
Zhang, W., Zeng, F., Wu, X., Zhang, X., Jiang, R: A comparative study of ensemble learning approaches in the classification of breast cancer metastasis. In: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (2009)
Google Scholar
Banfield, E.R., Lawrence, O.H., Kevin, W.B., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Learn. 29(1) (2007)
Google Scholar
Verma, A., Arora, A.: Reflexive hybrid approach to provide precise answer of user desired frequently asked question. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 159–163. IEEE, January 2017
Google Scholar
Sundaram, M.S., Anand Kumar, M., Soman, K.P.: AMRITA CEN@ SemEval-2015: paraphrase detection for Twitter using unsupervised feature learning with recursive autoencoders. In: SemEval-2015, p. 45 (2015)
Google Scholar
Mahalakshmi, S., Anand Kumar, M., Soman, K.P.: Paraphrase detection for Tamil language using deep learning algorithm. Int. J. Appl. Eng. Res. 10(17), 13929–13934 (2015)
Google Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K.P.: DPIL@FIRE2016: overview of shared task on detecting paraphrases in Indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7–10, CEUR Workshop Proceedings (2016). http://ceur-ws.org/
http://ceur-ws.org/Vol-1737/T6-8.pdf
http://scikit-learn.org/stable/
Verma, A., Mehta, S.: A comparative study of ensemble learning methods for classification in bioinformatics. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 155–158. IEEE, January 2017
Google Scholar

Download references

Acknowledgments

We would like to thank the organizers of FIRE 2016 for conducting this shared task on Detecting Paraphrases for Indian Languages (DPIL) and building the paraphrase corpora. We would also like to thank Sapient Corporation and Hays Business Solutions for giving us an opportunity to work and explore the world of text analytics.

Author information

Authors and Affiliations

Sapient Global Markets, Gurugram, India
Anuj Saini
Hays Business Solutions, Noida, India
Aayushi Verma

Authors

Anuj Saini
View author publications
You can also search for this author in PubMed Google Scholar
Aayushi Verma
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anuj Saini .

Editor information

Editors and Affiliations

DAIICT, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
DAIICT, Gujarat, India
Parth Mehta
DAIICT, Gujarat, India
Jainisha Sankhavara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Saini, A., Verma, A. (2018). Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language Using Machine Learning. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-73606-8_11
Published: 04 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics