Skip to main content

Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language Using Machine Learning

  • Conference paper
  • First Online:
Book cover Text Processing (FIRE 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Abstract

Every language possesses plausible several interpretations. With the evolution of web, smart devices and social media it has become a challenging task to identify these syntactic or semantic ambiguities. In Natural Language Processing, two statements written using different words having same meaning is termed as paraphrasing. At FIRE 2016, we have worked upon the problem of detecting paraphrases for the given Shared Task DPIL (Detecting Paraphrases in Indian Languages) in Hindi Language specifically. This paper proposes a novel approach to identify if two statements are paraphrased or not using various machine learning algorithms like Random Forest, Support Vector Machine, Gradient Boosting and Gaussian Naïve Bayes on the given training data set of two subtasks. In cross validation experiments, Random Forest outperforms the other methods with F1-score of 0.94. We have extended our work by adding few more features and using the former best classifier resulting in improvement of F1-score by 1%. The experimental results depict that our algorithm got the highest F1-score and accuracy and hence, secured the first rank in Hindi language in this shared task among all participants. Our novel approach can be used in various applications such as question-answering system, document clustering, machine translation, text summarization, plagiarism detection and many more.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Sethi, N., Agrawal, P., Madaan, V., Singh, S.K.: A novel approach to paraphrase Hindi sentences using natural language processing. Indian J. Sci. Technol. 9(28), July 2016. https://doi.org/10.17485/ijst/2016/v9i28/98374

  2. Kumar, N.: A graph based automatic plagiarism detection technique to handle artificial word reordering and paraphrasing. In: Gelbukh, A. (ed.) CICLing 2014. LNCS, vol. 8404, pp. 481–494. Springer, Heidelberg (2014). https://doi.org/10.1007/978-3-642-54903-8_40

    Chapter  Google Scholar 

  3. Xu, W., Callison-Burch, C., Dolan, W.B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, June 4–5, pp. 1–11. Association for Computational Linguistics (2015)

    Google Scholar 

  4. https://www.aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_(State_of_the_art)

  5. http://www.cfilt.iitb.ac.in/wordnet/webhwn/downloaderInfo.php

  6. Zhang, W., Zeng, F., Wu, X., Zhang, X., Jiang, R: A comparative study of ensemble learning approaches in the classification of breast cancer metastasis. In: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing (2009)

    Google Scholar 

  7. Banfield, E.R., Lawrence, O.H., Kevin, W.B., Kegelmeyer, W.P.: A comparison of decision tree ensemble creation techniques. IEEE Trans. Pattern Anal. Mach. Learn. 29(1) (2007)

    Google Scholar 

  8. Verma, A., Arora, A.: Reflexive hybrid approach to provide precise answer of user desired frequently asked question. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 159–163. IEEE, January 2017

    Google Scholar 

  9. Sundaram, M.S., Anand Kumar, M., Soman, K.P.: AMRITA CEN@ SemEval-2015: paraphrase detection for Twitter using unsupervised feature learning with recursive autoencoders. In: SemEval-2015, p. 45 (2015)

    Google Scholar 

  10. Mahalakshmi, S., Anand Kumar, M., Soman, K.P.: Paraphrase detection for Tamil language using deep learning algorithm. Int. J. Appl. Eng. Res. 10(17), 13929–13934 (2015)

    Google Scholar 

  11. Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)

    Google Scholar 

  12. Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K.P.: DPIL@FIRE2016: overview of shared task on detecting paraphrases in Indian languages. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation, Kolkata, India, December 7–10, CEUR Workshop Proceedings (2016). http://ceur-ws.org/

  13. http://ceur-ws.org/Vol-1737/T6-8.pdf

  14. http://scikit-learn.org/stable/

  15. Verma, A., Mehta, S.: A comparative study of ensemble learning methods for classification in bioinformatics. In: 2017 7th International Conference on Cloud Computing, Data Science and Engineering-Confluence, pp. 155–158. IEEE, January 2017

    Google Scholar 

Download references

Acknowledgments

We would like to thank the organizers of FIRE 2016 for conducting this shared task on Detecting Paraphrases for Indian Languages (DPIL) and building the paraphrase corpora. We would also like to thank Sapient Corporation and Hays Business Solutions for giving us an opportunity to work and explore the world of text analytics.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anuj Saini .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Saini, A., Verma, A. (2018). Anuj@DPIL-FIRE2016: A Novel Paraphrase Detection Method in Hindi Language Using Machine Learning. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_11

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73606-8_11

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73605-1

  • Online ISBN: 978-3-319-73606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics