Skip to main content

Fast Phonetic Similarity Search over Large Repositories

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8645))

Abstract

Analysis of unstructured data may be inefficient in the presence of spelling errors. Existing approaches use string similarity methods to search for valid words within a text, with a supporting dictionary. However, they are not rich enough to encode phonetic information to assist the search. In this paper, we present a novel approach for efficiently perform phonetic similarity search over large data sources, that uses a data structure called PhoneticMap to encode language-specific phonetic information. We validate our approach through an experiment over a data set using a Portuguese variant of a well-known repository, to automatically correct words with spelling errors.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allison, L., Dix, T.I.: A Bit-String Longest-Common-Subsequence Algorithm. In: IPL, vol. 26, pp. 305–310 (1986)

    Google Scholar 

  2. Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. Department of Informatics, University of Zurich (2007)

    Google Scholar 

  3. Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IIWeb, pp. 73–78 (2003)

    Google Scholar 

  4. Godbole, S., Bhattacharya, I., Gupta, A., Verma, A.: Building re-usable dictionary repositories for real-world text mining. In: CIKM, pp. 1189–1198. ACM (2010)

    Google Scholar 

  5. Gomaa, W.H., Fahmy, A.A.: A Survey of Text Similarity Approaches. In: IJCA, vol. 68, pp. 13–18. Foundation of Computer Science, New York (2013)

    Google Scholar 

  6. Hall, P.A.V., Dowling, G.R.: Approximate String Matching. ACM Comput. Surv. 12, 381–402 (1980)

    Article  MathSciNet  Google Scholar 

  7. Hamming, R.: Error Detecting and Error Correcting Codes. Bell System Technical Journal BSTJ. 26, 147–160 (1950)

    Article  MathSciNet  Google Scholar 

  8. Jellouli, I., Mohajir, M.E.: An ontology-based approach for web information extraction. In: CIST, p. 5 (2011)

    Google Scholar 

  9. Levenshtein, V.I.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady 10, 707–710 (1966)

    MathSciNet  Google Scholar 

  10. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38, 39–41 (1995)

    Article  Google Scholar 

  11. Stvilia, B.: A model for ontology quality evaluation. First Monday 12 (2007)

    Google Scholar 

  12. Mann, V.A.: Distinguishing universal and language-dependent levels of speech perception: Evidence from Japanese listeners’ perception of English. Cognition 24, 169–196 (1986)

    Article  Google Scholar 

  13. Paterson, M., Dancik, V.: Longest Common Subsequences. In: Privara, I., Ružička, P., Rovan, B. (eds.) MFCS 1994. LNCS, vol. 841, pp. 127–142. Springer, Heidelberg (1994)

    Chapter  Google Scholar 

  14. Winkler, W.E.: String Comparator Metrics and Enhanced Decision Rules in the Fellegi-Sunter Model of Record Linkage. In: Proceedings of the Section on Survey Research, pp. S.354–S.359 (1990)

    Google Scholar 

  15. Zobel, J., Dart, P.W.: Phonetic String Matching: Lessons from Information Retrieval. In: SIGIR, pp. 166–172. ACM (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tissot, H., Peschl, G., Del Fabro, M.D. (2014). Fast Phonetic Similarity Search over Large Repositories. In: Decker, H., Lhotská, L., Link, S., Spies, M., Wagner, R.R. (eds) Database and Expert Systems Applications. DEXA 2014. Lecture Notes in Computer Science, vol 8645. Springer, Cham. https://doi.org/10.1007/978-3-319-10085-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-10085-2_6

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-10084-5

  • Online ISBN: 978-3-319-10085-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics