Skip to main content

Integrating Approximate String Matching with Phonetic String Similarity

  • Conference paper
  • First Online:
Advances in Databases and Information Systems (ADBIS 2018)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11019))

Included in the following conference series:

Abstract

Well-defined dictionaries of tagged entities are used in many tasks to identify entities where the scope is limited and there is no need to use machine learning. One common solution is to encode the input dictionary into Trie trees to find matches on an input text. However, the size of the dictionary and the presence of spelling errors on the input tokens have a negative influence on such solutions. We present an approach that transforms the dictionary and each input token into a compact well-known phonetic representation. The resulting dictionary is encoded in a Trie that is about 72% smaller than a non-phonetic Trie. We perform inexact matching over this representation to filter a set of initial results. Lastly, we apply a second similarity measure to filter the best result to annotate a given entity. The experiments showed that it achieved good F1 results. The solution was developed as an entity recognition plug-in for GATE, a well-known information extraction framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://gitlab.c3sl.ufpr.br/faes/asm/tree/master.

  2. 2.

    https://gitlab.c3sl.ufpr.br/faes/asm/tree/master.

  3. 3.

    https://en.wikipedia.org/wiki/Wikipedia:Lists_of_common_misspellings.

  4. 4.

    Implementation from the commons-codec-1.10.jar library, available at https://commons.apache.org/proper/commons-codec/download_codec.cgi.

  5. 5.

    Implementation from lucene-suggest-5.2.1.jar, at http://lucene.apache.org/.

References

  1. Cunningham, H.: Information extraction, automatic. In: Encyclopedia of Language and Linguistics, 2nd edn. (2005)

    Google Scholar 

  2. Cunningham, H., Tablan, V., Roberts, A., Bontcheva, K.: Getting more out of biomedical documents with gate’s full lifecycle open source text analytics. PLOS Comput. Biol. 9(2), e1002854 (2013)

    Article  Google Scholar 

  3. Deng, D., Li, G., Wen, H., Jagadish, H.V., Feng, J.: Meta: an efficient matching-based method for error-tolerant autocompletion. Proc. VLDB Endow. 9(10), 828–839 (2016)

    Article  Google Scholar 

  4. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: COLING, vol. 96, pp. 466–471 (1996)

    Google Scholar 

  5. Ji, S., Li, G., Li, C., Feng, J.: Efficient interactive fuzzy keyword search. In: Proceedings of the 18th WWW, WWW 2009, Madrid, Spain, pp. 371–380. ACM (2009)

    Google Scholar 

  6. Lamontagne, L., Abi-Zeid, I.: Combining multiple similarity metrics using a multicriteria approach. In: Roth-Berghofer, T.R., Göker, M.H., Güvenir, H.A. (eds.) ECCBR 2006. LNCS (LNAI), vol. 4106, pp. 415–428. Springer, Heidelberg (2006). https://doi.org/10.1007/11805816_31

    Chapter  Google Scholar 

  7. Li, G., Ji, S., Li, C., Feng, J.: Efficient type-ahead search on relational data: a tastier approach. In: Proceedings of the 2009 ACM SIGMOD, SIGMOD 2009, Providence, Rhode Island, USA, pp. 695–706. ACM (2009)

    Google Scholar 

  8. Li, G., Ji, S., Li, C., Feng, J.: Efficient fuzzy full-text type-ahead search. VLDB J. 20(4), 617–640 (2011)

    Article  Google Scholar 

  9. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  10. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  11. Philips, L.: Hanging on the metaphone. Comput. Lang. Mag. 7(12), 38–44 (1990)

    Google Scholar 

  12. Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)

    Article  Google Scholar 

  13. Tissot, H., Peschl, G., Del Fabro, M.D.: Fast phonetic similarity search over large repositories. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds.) DEXA 2014. LNCS, vol. 8645, pp. 74–81. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10085-2_6

    Chapter  Google Scholar 

  14. Culotta, A., Kristjansson, T., McCallum, A., Viola, P.: Corrective feedback and persistent learning for information extraction. Artif. Intell. 170(14), 1101–1122 (2006)

    Article  MathSciNet  Google Scholar 

  15. Stonebraker, M., Tao, W., Deng, D.: Approximate string joins with abbreviations. Proc. VLDB Endow. 11(1), 53–65 (2017)

    Article  Google Scholar 

  16. Wimalasuriya, D.C., Dou, D.: Ontology-based information extraction: an introduction and a survey of current approaches. J. Inf. Sci. 36(3), 306–323 (2010)

    Article  Google Scholar 

  17. Xiao, C., Qin, J., Wang, W., Ishikawa, Y., Tsuda, K., Sadakane, K.: Efficient error-tolerant query autocompletion. VLDB Endow. 6(6), 373–384 (2013)

    Article  Google Scholar 

  18. Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval. In: The 19th SIGIR, SIGIR 1996, Zurich, Switzerland, pp. 166–172. ACM (1996)

    Google Scholar 

Download references

Acknowledgments

This work was partially funded by Project Sistema de Monitoramento de Políticas de Promoção da Igualdade Racial (SNPPIR).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcos Didonet Del Fabro .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ferri, J., Tissot, H., Del Fabro, M.D. (2018). Integrating Approximate String Matching with Phonetic String Similarity. In: Benczúr, A., Thalheim, B., Horváth, T. (eds) Advances in Databases and Information Systems. ADBIS 2018. Lecture Notes in Computer Science(), vol 11019. Springer, Cham. https://doi.org/10.1007/978-3-319-98398-1_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-98398-1_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-98397-4

  • Online ISBN: 978-3-319-98398-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics