Detection of Duplicates in Quora and Twitter Corpus

Viswanathan, Sujith; Damodaran, Nikhil; Simon, Anson; George, Anon; Anand Kumar, M.; Soman, K. P.

doi:10.1007/978-981-13-1882-5_45

Sujith Viswanathan¹⁷,
Nikhil Damodaran¹⁷,
Anson Simon¹⁷,
Anon George¹⁷,
M. Anand Kumar¹⁷ &
…
K. P. Soman¹⁷

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 750))

962 Accesses
7 Citations

Abstract

Detection of duplicate sentences from a corpus containing a pair of sentences deals with identifying whether two sentences in the pair convey the same meaning or not. This detection of duplicates helps in deduplication, a process in which duplicates are removed. Traditional natural language processing techniques are less accurate in identifying similarity between sentences, such similar sentences can also be referred as paraphrases. Using Quora and Twitter paraphrase corpus, we explored various approaches including several machine learning algorithms to obtain a liable approach that can identify the duplicate sentences given a pair of sentences. This paper discusses the performance of six supervised machine learning algorithms in two different paraphrase corpus, and it focuses on analyzing how accurately the algorithms classify sentences present in the corpus as duplicates and non-duplicates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Anand Kumar, M., Singh, S., Kavirajan, B., Soman, K.: Shared task on detecting paraphrases in indian languages (dpil): An overview. Lecture Notes in computer science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) pp. 128–140 (2018)
Google Scholar
Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 546–556. Association for Computational Linguistics (2012)
Google Scholar
Cordeiro, J., Dias, G., Brazdil, P.: A metric for paraphrase detection. In: International Multi-Conference on Computing in the Global Information Technology, 2007. ICCGI 2007, pp. 7–7. IEEE (2007)
Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Google Scholar
Huang, C.H., Yin, J., Hou, F.: A text similarity measurement combining word semantic information with tf-idf method. Jisuanji Xuebao(Chin. J. Comput.) 34(5), 856–864 (2011)
Google Scholar
Iyer, S., Dandekar, N., Csernai, K.: First quora dataset release: Question pairs (2017)
Google Scholar
Joao, C., Gaël, D., Pavel, B.: New functions for unsupervised asymmetrical paraphrase detection. J. Software 2(4), 12–23 (2007)
Article Google Scholar
Mahalakshmi, S., Anand Kumar, M., Soman, K.: Paraphrase detection for tamil language using deep learning algorithm. Int. J. of Appld. Engg. Res 10(17), 13929–13934 (2015)
Google Scholar
Mueller, J., Thyagarajan, A.: Siamese recurrent architectures for learning sentence similarity. In: AAAI, pp. 2786–2792 (2016)
Google Scholar
Praveena, R., Kumar, M.A., Soman, K.P.: Chunking based malayalam paraphrase identification using unfolding recursive autoencoders. In: 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) pp. 922–928 (2017)
Google Scholar
Socher, R., Huang, E.H., Pennin, J., Manning, C.D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in neural information processing systems, pp. 801–809 (2011)
Google Scholar
Xu, W., Callison-Burch, C., Dolan, B.: Semeval-2015 task 1: Paraphrase and semantic similarity in twitter (pit). In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), pp. 1–11 (2015)
Google Scholar
Xu, W., Ritter, A., Callison-Burch, C., Dolan, W.B., Ji, Y.: Extracting lexically divergent paraphrases from twitter. Trans. Assoc. Comput. Linguist. 2, 435–448 (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Computational Engineering and Networking (CEN), Amrita Vishwa Vidyapeetham, Amrita School of Engineering, Coimbatore, India
Sujith Viswanathan, Nikhil Damodaran, Anson Simon, Anon George, M. Anand Kumar & K. P. Soman

Authors

Sujith Viswanathan
View author publications
You can also search for this author in PubMed Google Scholar
Nikhil Damodaran
View author publications
You can also search for this author in PubMed Google Scholar
Anson Simon
View author publications
You can also search for this author in PubMed Google Scholar
Anon George
View author publications
You can also search for this author in PubMed Google Scholar
M. Anand Kumar
View author publications
You can also search for this author in PubMed Google Scholar
K. P. Soman
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sujith Viswanathan .

Editor information

Editors and Affiliations

Department of Computer Sciences Technology, Karunya Institute of Technology & Sciences, Coimbatore, Tamil Nadu, India
J. Dinesh Peter
Department of Civil and Environmental Engineering, University of Missouri, Columbia, MO, USA
Amir H. Alavi
School of Computing, Engineering and Mathematics, University of Western Sydney, Sydney, NSW, Australia
Bahman Javadi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Viswanathan, S., Damodaran, N., Simon, A., George, A., Anand Kumar, M., Soman, K.P. (2019). Detection of Duplicates in Quora and Twitter Corpus. In: Peter, J., Alavi, A., Javadi, B. (eds) Advances in Big Data and Cloud Computing. Advances in Intelligent Systems and Computing, vol 750. Springer, Singapore. https://doi.org/10.1007/978-981-13-1882-5_45

Download citation

DOI: https://doi.org/10.1007/978-981-13-1882-5_45
Published: 12 December 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1881-8
Online ISBN: 978-981-13-1882-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

Detection of Duplicates in Quora and Twitter Corpus