Skip to main content

Anatomy of Preprocessing of Big Data for Monolingual Corpora Paraphrase Extraction: Source Language Sentence Selection

  • Conference paper
  • First Online:
Emerging Technologies in Data Mining and Information Security

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 814))

Abstract

In the scope of our work cross-language information retrieval, the ultimate goal is to develop an intelligent model using non-corresponding corpora of Hindi and English language. It consists of two phases which are Source Language Sentence Extraction (SLSE) and model building for translation. SLSE is the training data for the model, comprising the 70% of entire work. In this paper, we have proposed a novel pipeline for SLSE by creating first bilingual dictionary, N-grams, inverse term document index, etc. As mentioned, SLSE is being used as training data so it plays a very crucial role in building model so more attention has been paid to ensure the content richness. In this work, two non-corresponding English and Hindi corpora ranging from 60 GB of text have been constructed. Collecting data at this scale is even more tedious as it is highly unstructured, and the processing time for big data is also substantially large. To reduce the processing time, Hadoop was implemented throughout.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ballesteros, L., Croft, B.: Dictionary methods for cross-lingual information retrieval. In: Proceedings of the 7th DEXA Conference on Database and Expert Systems Applications, Zurich, Switzerland, Sept 1996, pp. 791–801

    Google Scholar 

  2. Berry Michael, W.: Automatic Discovery of Similar Words, in Survey of Text Mining: Clustering, Classification, and Retrieval, pp. 24–43. Springer, New York, LLC (2004)

    Google Scholar 

  3. Hearst, M.A.: Untangling text data mining. In: Proceedings of ACL’99: The 37th Annual Meeting of the Association for Computational Linguistics, University of Maryland, 20–26 June 1999

    Google Scholar 

  4. Jianming, C., Jianming, L., Zhouyu, L.: Research of Text Categorization Based on Support Vector Machine. Comput. Simul. 30(2), 299–302 (2013)

    Google Scholar 

  5. Ballesteros, L., Croft, B.: Resolving ambiguity for cross-language retrieval. In: Proceedings of SIGIR‘98, Melbourne, Australia, pp. 64–71, Aug 1998

    Google Scholar 

  6. Nguyen-Son, H.-Q., Miyao, Y., Echizen, I.: Paraphrase detection based on identical phrase and similar word matching. In: 29th Pacific Asia Conference on Language (2015)

    Google Scholar 

  7. Yin, W., Schütze, H.: Convolutional neural network for paraphrase identification. In: Human Language Technologies: The 2015 Annual Conference of the North American Chapter of the ACL, Denver, Colorado, pp. 901–911 (2015)

    Google Scholar 

  8. Liang, C., Paritosh, P., Rajendran, V., Forbus, K.D.: Learning paraphrase identification with structural alignment. In: Conference: IJCAI 2016, at New York

    Google Scholar 

  9. Lee, J.C., Cheah, Y.: Paraphrase detection using string similarity with synonyms. In: The Fourth Asian Conference on Information Systems, ACIS 2015

    Google Scholar 

  10. McCallum, A.: Information extraction: Distilling structured data from unstructured text. Queue 3(9), 48–57 (2005)

    Article  Google Scholar 

  11. Dean, J., Ghemawat, S., MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating Systems Design and Implementation, pp. 137–149 (2004)

    Google Scholar 

  12. Shvachko, K., Kuang, H., Radia, S., Chandler, R.: The hadoop distributed file system. In: Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST2010), pp. 1–10 (2010)

    Google Scholar 

  13. Li, Y., Shawe-Taylor, J.: Using KCCA for Japanese-English cross-language information retrieval and document classification. J. Intell. Inf. Syst. 27(2), 117–133 (2006)

    Article  Google Scholar 

  14. Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M. (eds.): Definition of the CIDOC Conceptual Reference Model, Version 5.0 (2008)

    Google Scholar 

  15. Yue, L.: Research of Cross-Language Text Classification. Beijing Institute of Technology, Beijing (2011)

    Google Scholar 

  16. Bel, N., Koster, C.H.A., Villegas, M.: Cross-lingual text categorization. Lect. Notes Comput. Sci. 2003(2769), 126–139 (2003)

    Article  Google Scholar 

  17. Smith, J.R., Quirk, C., Toutanova, K.: Extracting parallel sentences from comparable corpora using document level alignment. In: NAACL/HLT, pp. 403–411 (2010)

    Google Scholar 

  18. Harris, Z.S.: Transformations in linguistic structure. Proc. Am. Philos. Soc. 108(5), 418–422 (1982)

    Google Scholar 

  19. Lee, J.C., Cheah, Y.-N.: Paraphrase detection using semantic relatedness based on Synset Shortest Path in WordNet. In: International Conference on Advanced Informatics: Concepts, Theory, and Applications, 16–17 Aug 2016, Parkroyal Penang Resort

    Google Scholar 

  20. Harris, Z.S.: A Grammar of English on Mathematical Principles. Wiley, New York, USA (1982)

    Google Scholar 

  21. Ballesteros, L., Croft, B.: Phrasal translation and query expansion techniques for cross-language information retrieval. In: Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1997)

    Google Scholar 

  22. Gao, J., Nie, J., Xun, E., Zhang, J., Zhou, M., Huang, C.: Improving query translation for cross-language information retrieval using statistical models. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM (2001)

    Google Scholar 

  23. Kumar, V., Kalitin, D., Tiwari, P.: Unsupervised learning dimensionality reduction algorithm PCA for face recognition. In: International Conference on Computing Communication and Automation (ICCCA), 5–6 May 2017, pp. 32–37

    Google Scholar 

  24. Tiwari, P., Mishra, B.K., Kumar, S., Kumar, V.: Implementation of n-gram methodology for rotten tomatoes review dataset sentiment analysis. Int. J. Knowl. Disc. Bioinf. (IJKDB) 7(1), 30–41. https://doi.org/10.4018/ijkdb.2017010103

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abhishek Verma .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kumar, V., Verma, A., Mittal, N., Gromov, S.V. (2019). Anatomy of Preprocessing of Big Data for Monolingual Corpora Paraphrase Extraction: Source Language Sentence Selection. In: Abraham, A., Dutta, P., Mandal, J., Bhattacharya, A., Dutta, S. (eds) Emerging Technologies in Data Mining and Information Security. Advances in Intelligent Systems and Computing, vol 814. Springer, Singapore. https://doi.org/10.1007/978-981-13-1501-5_43

Download citation

Publish with us

Policies and ethics