Anatomy of Preprocessing of Big Data for Monolingual Corpora Paraphrase Extraction: Source Language Sentence Selection

Kumar, Vivek; Verma, Abhishek; Mittal, Namita; Gromov, Sergey V.

doi:10.1007/978-981-13-1501-5_43

Vivek Kumar¹⁹,
Abhishek Verma²⁰,
Namita Mittal²⁰ &
…
Sergey V. Gromov¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 814))

1195 Accesses
11 Citations

Abstract

In the scope of our work cross-language information retrieval, the ultimate goal is to develop an intelligent model using non-corresponding corpora of Hindi and English language. It consists of two phases which are Source Language Sentence Extraction (SLSE) and model building for translation. SLSE is the training data for the model, comprising the 70% of entire work. In this paper, we have proposed a novel pipeline for SLSE by creating first bilingual dictionary, N-grams, inverse term document index, etc. As mentioned, SLSE is being used as training data so it plays a very crucial role in building model so more attention has been paid to ensure the content richness. In this work, two non-corresponding English and Hindi corpora ranging from 60 GB of text have been constructed. Collecting data at this scale is even more tedious as it is highly unstructured, and the processing time for big data is also substantially large. To reduce the processing time, Hadoop was implemented throughout.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Ballesteros, L., Croft, B.: Dictionary methods for cross-lingual information retrieval. In: Proceedings of the 7th DEXA Conference on Database and Expert Systems Applications, Zurich, Switzerland, Sept 1996, pp. 791–801
Google Scholar
Berry Michael, W.: Automatic Discovery of Similar Words, in Survey of Text Mining: Clustering, Classification, and Retrieval, pp. 24–43. Springer, New York, LLC (2004)
Google Scholar
Hearst, M.A.: Untangling text data mining. In: Proceedings of ACL’99: The 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, 20–26 June 1999
Google Scholar
Jianming, C., Jianming, L., Zhouyu, L.: Research of Text Categorization Based on Support Vector Machine. Comput. Simul. 30(2), 299–302 (2013)
Google Scholar
Ballesteros, L., Croft, B.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR‘98, Melbourne, Australia, pp. 64–71, Aug 1998
Google Scholar
Nguyen-Son, H.-Q., Miyao, Y., Echizen, I.: Paraphrase detection based on identical phrase and similar word matching. In: 29th Pacific Asia Conference on Language (2015)
Google Scholar
Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, pp. 901–911 (2015)
Google Scholar
Liang, C., Paritosh, P., Rajendran, V., Forbus, K.D.: Learning paraphrase identification with structural alignment. In: Conference: IJCAI 2016, at New York
Google Scholar
Lee, J.C., Cheah, Y.: Paraphrase detection using string similarity with synonyms. In: The Fourth Asian Conference on Information Systems, ACIS 2015
Google Scholar
McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005)
Article Google Scholar
Dean, J., Ghemawat, S., MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)
Google Scholar
Shvachko, K., Kuang, H., Radia, S., Chandler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST2010), pp. 1–10 (2010)
Google Scholar
Li, Y., Shawe-Taylor, J.: Using KCCA for Japanese-English cross-language information retrieval and document classification. J. Intell. Inf. Syst. 27(2), 117–133 (2006)
Article Google Scholar
Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M. (eds.): Definition of the CIDOC Conceptual Reference Model, Version 5.0 (2008)
Google Scholar
Yue, L.: Research of Cross-Language Text Classification. Beijing Institute of Technology, Beijing (2011)
Google Scholar
Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. Lect. Notes Comput. Sci. 2003(2769), 126–139 (2003)
Article Google Scholar
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: NAACL/HLT, pp. 403–411 (2010)
Google Scholar
Harris, Z.S.: Transformations in linguistic structure. Proc. Am. Philos. Soc. 108(5), 418–422 (1982)
Google Scholar
Lee, J.C., Cheah, Y.-N.: Paraphrase detection using semantic relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advanced Informatics: Concepts, Theory, and Applications, 16–17 Aug 2016, Parkroyal Penang Resort
Google Scholar
Harris, Z.S.: A Grammar of English on Mathematical Principles. Wiley, New York, USA (1982)
Google Scholar
Ballesteros, L., Croft, B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1997)
Google Scholar
Gao, J., Nie, J., Xun, E., Zhang, J., Zhou, M., Huang, C.: Improving query translation for cross-language information retrieval using statistical models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2001)
Google Scholar
Kumar, V., Kalitin, D., Tiwari, P.: Unsupervised learning dimensionality reduction algorithm PCA for face recognition. In: International Conference on Computing Communication and Automation (ICCCA), 5–6 May 2017, pp. 32–37
Google Scholar
Tiwari, P., Mishra, B.K., Kumar, S., Kumar, V.: Implementation of n-gram methodology for rotten tomatoes review dataset sentiment analysis. Int. J. Knowl. Disc. Bioinf. (IJKDB) 7(1), 30–41. https://doi.org/10.4018/ijkdb.2017010103

Download references

Author information

Authors and Affiliations

National University of Science and Technology-MISiS, Moscow, 119049, Russian Federation
Vivek Kumar & Sergey V. Gromov
Malaviya National Institute of Technology, Jaipur, 302017, India
Abhishek Verma & Namita Mittal

Authors

Vivek Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Abhishek Verma
View author publications
You can also search for this author in PubMed Google Scholar
Namita Mittal
View author publications
You can also search for this author in PubMed Google Scholar
Sergey V. Gromov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abhishek Verma .

Editor information

Editors and Affiliations

Machine Intelligence Research Labs, Auburn, WA, USA
Ajith Abraham
Department of Computer and Systems Sciences, Visva-Bharati University, Santiniketan, West Bengal, India
Paramartha Dutta
Department of Computer Science and Engineering, University of Kalyani, Kalyani, India
Jyotsna Kumar Mandal
Institute of Engineering and Management, Kolkata, West Bengal, India
Abhishek Bhattacharya
Institute of Engineering and Management, Kolkata, West Bengal, India
Soumi Dutta

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, V., Verma, A., Mittal, N., Gromov, S.V. (2019). Anatomy of Preprocessing of Big Data for Monolingual Corpora Paraphrase Extraction: Source Language Sentence Selection. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol 814. Springer, Singapore. https://doi.org/10.1007/978-981-13-1501-5_43

Download citation

DOI: https://doi.org/10.1007/978-981-13-1501-5_43
Published: 02 September 2018
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1500-8
Online ISBN: 978-981-13-1501-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics