Knowledge-lean Paraphrase Identification Using Character-Based Features

Eyecioglu, Asli; Keller, Bill

doi:10.1007/978-3-319-71746-3_21

Asli Eyecioglu¹² &
Bill Keller¹³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 789))

Included in the following conference series:

Conference on Artificial Intelligence and Natural Language

1308 Accesses
4 Citations

Abstract

The paraphrase identification task has practical importance in the NLP community because of the need to deal with the pervasive problem of linguistic variation. Accurate methods should help improve the performance of NLP applications, including machine translation, information retrieval, question answering, text summarization, document clustering and plagiarism detection, amongst others. We consider an approach to paraphrase identification that may be considered “knowledge-lean”. Our approach minimizes the need for data transformation and avoids the use of knowledge-based tools and resources. Candidate paraphrase pairs are represented using combinations of word- and character-based features. We show that SVM classifiers may be trained to distinguish paraphrase and non-paraphrase pairs across a number of different paraphrase corpora with good results. Analysis shows that features derived from character bigrams are particularly informative. We also describe recent experiments in identifying paraphrase for Russian, a language with rich morphology and free word order that presents a particularly interesting challenge for our knowledge-lean approach. We are able to report good results on a three-way paraphrase classification task.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
In order to download TuPC: https://osf.io/wp83a/.
2.
A debatable pair arises where the decisions of the four annotators are equally divided between “paraphrase” and “non-paraphrase”.
3.
http://ainlconf.ru/paraphraser.
4.
http://matplotlib.org/.

References

Agirre, E., et al.: Semeval-2012 task 6: A pilot on semantic textual similarity. In: Proceedings of the 6th International Workshop on Semantic Evaluation, in Conjunction with the First Joint Conference on Lexical and Computational Semantics, pp. 385–393 (2012)
Google Scholar
Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. Artif. Intell. Res. 38(1), 135–187 (2010)
MATH Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43th Annual Meeting on Association for Computational Linguistics, pp. 597–604 (2005)
Google Scholar
Barron-Cedeno, A., et al.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Article Google Scholar
Barzilay, R., et al.: Information fusion in the context of multi-document summarization. In: Proceedings of ACL, pp. 550–557 (1999)
Google Scholar
Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Naacl-2003, pp. 16–23 (2003)
Google Scholar
Blacoe, W., Lapata, M.: A comparison of vector-based representations for semantic composition. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL 2012), pp. 546–556 (2012)
Google Scholar
Callison-Burch, C., et al.: Improved statistical machine translation using paraphrases. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL 2006), pp. 17–24 (2006)
Google Scholar
Chang, C., Lin, C.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 1–27 (2011)
Article Google Scholar
Culicover, P.W.: Paraphrase generation and information retrieval from stored text. Mech. Transl. Comput. Linguist. 11(1–2), 78–88 (1968)
Google Scholar
Das, D., Smith, N.A.: Paraphrase identification as probabilistic quasi-synchronous recognition. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: ACL-IJCNLP 2009, pp. 468–476 (2009)
Google Scholar
Demir, S., et al.: Turkish paraphrase corpus. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012), pp. 4087–4091 (2012)
Google Scholar
Dolan, W.B., et al.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, COLING 2004. Association for Computational Linguistics, Geneva (2004)
Google Scholar
Dolan, W.B., Brockett, C.: Automatically constructing a corpus of sentential paraphrases. In: Proceedings of IWP, pp. 9–16. Asia Federation of Natural Language Processing (2005)
Google Scholar
Duclaye, F., et al.: Using the web as a linguistic resource for learning reformulations automatically. In: Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands, Spain, pp. 390–396 (2002)
Google Scholar
Eyecioglu, A., Keller, B.: ASOBEK: Twitter paraphrase identification with simple overlap features and SVMs. In: Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 64–69 (2015)
Google Scholar
Eyecioglu, A., Keller, B.: Constructing a Turkish corpus for paraphrase identification and semantic similarity. In: Proceedings of the 17th International Conference on Intelligent Text Processing and Computational Linguistics. LNCS, pp. 562–574 (2016)
Google Scholar
Fellbaum, C.: WordNet. An Electronic Lexical Database. MIT Press, Cambridge (1998)
MATH Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Google Scholar
Finch, A., et al.: Using Machine translation evaluation techniques to determine sentence-level semantic equivalence. In: Proceedings of the Third International Workshop on Paraphrasing (IWP 2005), pp. 17–24 (2005)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76, 378–382 (1971)
Article Google Scholar
Ganitkevitch, J., et al.: Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation, pp. 1168–1179. Computational Linguistics (2011)
Google Scholar
He, W., et al.: Enriching SMT training data via paraphrasing. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, IJCNLP 2011, pp. 803–810. Asian Federation of Natural Language Processing (2011)
Google Scholar
Hearst, M.A., Grefenstette, G.: Refining automatically-discovered lexical relations: combining weak techniques for stronger results. In: Statistically-Based Natural Language Programming Techniques, Papers from the 1992 AAAI Workshop, Menlo Park, CA, pp. 64–72 (1992)
Google Scholar
Hsu, C.-W., et al.: A practical guide to support vector classification. BJU Int. 101(1), 1396–1400 (2008)
Google Scholar
Hunter, J.D.: Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9(3), 90–95 (2007)
Article Google Scholar
Ji, Y., Eisenstein, J.: Discriminative improvements to distributional sentence similarity. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 891–896. Association for Computational Linguistics, Seattle (2013)
Google Scholar
Kim, Y., et al.: Character-aware neural language models. CoRR 1508.06615 (2015)
Google Scholar
Kozareva, Z., Montoyo, A.: Paraphrase identification on the basis of supervised machine learning techniques. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds.) FinTAL 2006. LNCS (LNAI), vol. 4139, pp. 524–533. Springer, Heidelberg (2006). https://doi.org/10.1007/11816508_52
Chapter Google Scholar
Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. CoRR 1508.02096 (2015)
Google Scholar
Lintean, M., Rus, V.: Dissimilarity kernels for paraphrase identification. In: Proceedings of the 24th International Florida Artificial Intelligence Research Society Conference, Palm Beach, FL, pp. 263–268 (2011)
Google Scholar
Madnani, N., et al.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2012), PA, USA, pp. 182–190 (2012)
Google Scholar
Madnani, N., et al.: Using paraphrases for parameter tuning in statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation (WMT 2007), Prague, Czech Republic (2007)
Google Scholar
Malakasiotis, P.: Paraphrase recognition using machine learning to combine similarity measures. In: Proceedings of the ACL-IJCNLP 2009 Student Research Workshop, Suntec, Singapore, pp. 27–35 (2009)
Google Scholar
Marton, Y., et al.: Filtering antonymous, trend-contrasting, and polarity-dissimilar distributional paraphrases for improving statistical machine translation. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 237–249. Association for Computational Linguistics, Edingburgh, Scotland (2011)
Google Scholar
Marton, Y., et al.: Improved statistical machine translation using monolingually-derived paraphrases. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), Singapore (2009)
Google Scholar
Mckeown, K.R.: Paraphrasing questions using given and new information. Comput. Linguist. 9(1), 1–10 (1983)
MathSciNet Google Scholar
Mihalcea, R., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 775–780. AAAI Press (2006)
Google Scholar
Owczarzak, K., et al.: Contextual bitext-derived paraphrases in automatic MT evaluation. In: StatMT 2006, Stroudsburg, PA, USA, pp. 86–93 (2006)
Google Scholar
Pedersen, T., Bruce, R.: Knowledge lean word-sense disambiguation. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence, pp. 800–805. AAAI Press (1998)
Google Scholar
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., et al.: Scikit-learn: Machine Learning in Python. http://scikit-learn.org/stable/
Pivovarova, L., et al.: ParaPhraser: Russian paraphrase corpus and shared task. In: Filchenkov, A., et al. (eds.) AINL 2017, CCIS, vol. 789, pp. 211–225. Springer, Cham (2018)
Google Scholar
Power, R., Scott, D.: Automatic generation of large-scale paraphrases. In: Proceedings of the 3rd International Workshop on Paraphrasing (IWP2005), Jeju, Republic of Korea, pp. 33–40 (2005)
Google Scholar
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Proceedings of the AINL-ISMW FRUCT 2015, pp. 74–82. IEEE (2015)
Google Scholar
Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. (2002)
Google Scholar
Rus, V., et al.: On paraphrase identification corpora. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Google Scholar
Shinyama, Y., et al.: Automatic paraphrase acquisition from news articles. In: Proceedings of the Second International Conference on Human Language Technology Research, pp. 313–318 (2002)
Google Scholar
Shinyama, Y., Sekine, S.: Paraphrase acquisition for information extraction. In: Proceedings of the second International Workshop on Paraphrasing - Volume 16 (PARAPHRASE 2003), vol. 16, pp. 65–71. Association for Computational Linguistics, Stroudsburg (2003)
Google Scholar
Socher, R., et al.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Wan, S., et al.: Using dependency-based features to take the ‘Para-farce’ out of paraphrase. In: Proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 131–138 (2006)
Google Scholar
Xu, W.: Data-driven approaches for paraphrasing across language variations. New York University (2014)
Google Scholar
Zhang, Y., Patrick, J.: Paraphrase identification by text canonicalization. In: proceedings of the Australasian Language Technology Workshop, Sydney, Australia, pp. 160–166 (2005)
Google Scholar

Download references

Acknowledgements

The authors gratefully acknowledge the comments of our reviewers on earlier drafts of this paper.

Author information

Authors and Affiliations

Bartin University, Bartin, 74100, Turkey
Asli Eyecioglu
University of Sussex, Brighton, BN19QJ, UK
Bill Keller

Authors

Asli Eyecioglu
View author publications
You can also search for this author in PubMed Google Scholar
Bill Keller
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Asli Eyecioglu .

Editor information

Editors and Affiliations

ITMO University, St. Petersburg, Russia
Andrey Filchenkov
University of Helsinki, Helsinki, Finland
Lidia Pivovarova
Mendel University , Brno, Czech Republic
Jan Žižka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eyecioglu, A., Keller, B. (2018). Knowledge-lean Paraphrase Identification Using Character-Based Features. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_21

Download citation

DOI: https://doi.org/10.1007/978-3-319-71746-3_21
Published: 28 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics