Abstract
In the scope of our work cross-language information retrieval, the ultimate goal is to develop an intelligent model using non-corresponding corpora of Hindi and English language. It consists of two phases which are Source Language Sentence Extraction (SLSE) and model building for translation. SLSE is the training data for the model, comprising the 70% of entire work. In this paper, we have proposed a novel pipeline for SLSE by creating first bilingual dictionary, N-grams, inverse term document index, etc. As mentioned, SLSE is being used as training data so it plays a very crucial role in building model so more attention has been paid to ensure the content richness. In this work, two non-corresponding English and Hindi corpora ranging from 60 GB of text have been constructed. Collecting data at this scale is even more tedious as it is highly unstructured, and the processing time for big data is also substantially large. To reduce the processing time, Hadoop was implemented throughout.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ballesteros, L., Croft, B.: Dictionary methods for cross-lingual information retrieval. In: Proceedings of the 7th DEXA Conference on Database and Expert Systems Applications, Zurich, Switzerland, Sept 1996, pp. 791–801
Berry Michael, W.: Automatic Discovery of Similar Words, in Survey of Text Mining: Clustering, Classification, and Retrieval, pp. 24–43. Springer, New York, LLC (2004)
Hearst, M.A.: Untangling text data mining. In: Proceedings of ACL’99: The 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, 20–26 June 1999
Jianming, C., Jianming, L., Zhouyu, L.: Research of Text Categorization Based on Support Vector Machine. Comput. Simul. 30(2), 299–302 (2013)
Ballesteros, L., Croft, B.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR‘98, Melbourne, Australia, pp. 64–71, Aug 1998
Nguyen-Son, H.-Q., Miyao, Y., Echizen, I.: Paraphrase detection based on identical phrase and similar word matching. In: 29th Pacific Asia Conference on Language (2015)
Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, pp. 901–911 (2015)
Liang, C., Paritosh, P., Rajendran, V., Forbus, K.D.: Learning paraphrase identification with structural alignment. In: Conference: IJCAI 2016, at New York
Lee, J.C., Cheah, Y.: Paraphrase detection using string similarity with synonyms. In: The Fourth Asian Conference on Information Systems, ACIS 2015
McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005)
Dean, J., Ghemawat, S., MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)
Shvachko, K., Kuang, H., Radia, S., Chandler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST2010), pp. 1–10 (2010)
Li, Y., Shawe-Taylor, J.: Using KCCA for Japanese-English cross-language information retrieval and document classification. J. Intell. Inf. Syst. 27(2), 117–133 (2006)
Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M. (eds.): Definition of the CIDOC Conceptual Reference Model, Version 5.0 (2008)
Yue, L.: Research of Cross-Language Text Classification. Beijing Institute of Technology, Beijing (2011)
Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. Lect. Notes Comput. Sci. 2003(2769), 126–139 (2003)
Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: NAACL/HLT, pp. 403–411 (2010)
Harris, Z.S.: Transformations in linguistic structure. Proc. Am. Philos. Soc. 108(5), 418–422 (1982)
Lee, J.C., Cheah, Y.-N.: Paraphrase detection using semantic relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advanced Informatics: Concepts, Theory, and Applications, 16–17 Aug 2016, Parkroyal Penang Resort
Harris, Z.S.: A Grammar of English on Mathematical Principles. Wiley, New York, USA (1982)
Ballesteros, L., Croft, B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1997)
Gao, J., Nie, J., Xun, E., Zhang, J., Zhou, M., Huang, C.: Improving query translation for cross-language information retrieval using statistical models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2001)
Kumar, V., Kalitin, D., Tiwari, P.: Unsupervised learning dimensionality reduction algorithm PCA for face recognition. In: International Conference on Computing Communication and Automation (ICCCA), 5–6 May 2017, pp. 32–37
Tiwari, P., Mishra, B.K., Kumar, S., Kumar, V.: Implementation of n-gram methodology for rotten tomatoes review dataset sentiment analysis. Int. J. Knowl. Disc. Bioinf. (IJKDB) 7(1), 30–41. https://doi.org/10.4018/ijkdb.2017010103
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Kumar, V., Verma, A., Mittal, N., Gromov, S.V. (2019). Anatomy of Preprocessing of Big Data for Monolingual Corpora Paraphrase Extraction: Source Language Sentence Selection. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol 814. Springer, Singapore. https://doi.org/10.1007/978-981-13-1501-5_43
Download citation
DOI: https://doi.org/10.1007/978-981-13-1501-5_43
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1500-8
Online ISBN: 978-981-13-1501-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)